[00:00:10] ori: it's short for cassandra [00:00:16] oh :) [00:00:34] I thought you were onto some next-gen programming language, a successor to C# [00:00:46] hehe [00:00:49] C+++ [00:01:45] it's quite common in the cassandra community [00:01:55] ACKNOWLEDGEMENT - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-tools/snapshot is not accessible: Permission denied Coren The issue is in the test itself. Still trying to sort it out. [00:02:08] urandom probably knows where it comes from [00:04:04] what's with the DCC CHAT gwicke? [00:05:07] Krenair: A/S/L? [00:06:14] PROBLEM - Persistent high iowait on labstore2001 is CRITICAL 71.43% of data above the critical threshold [35.0] [00:06:43] Hm. Yes, so those triggers are too low for general use because backups. [00:06:54] * Coren wags finger at icinga [00:09:19] Coren: yes. [00:11:50] (03PS1) 10Yuvipanda: labstore: Fixup start-nfs for new storage layout [puppet] - 10https://gerrit.wikimedia.org/r/228179 (https://phabricator.wikimedia.org/T106590) [00:12:14] RECOVERY - Persistent high iowait on labstore2001 is OK Less than 50.00% above the threshold [25.0] [00:12:17] coren ^ as well [00:12:24] (the patch, not the alert) [00:12:43] (03PS1) 10coren: labstore: tweak alerting thresholds [puppet] - 10https://gerrit.wikimedia.org/r/228180 [00:13:26] (03CR) 10Yuvipanda: [C: 031] labstore: tweak alerting thresholds [puppet] - 10https://gerrit.wikimedia.org/r/228180 (owner: 10coren) [00:14:01] YuviPanda: I was about to ask whether my commit message was clear enough. [00:15:05] coren yup, is good enough [00:15:54] RECOVERY - puppet last run on ruthenium is OK Puppet is currently enabled, last run 48 seconds ago with 0 failures [00:15:54] (03CR) 10coren: "The replacement for nfs-project-exports now starts at boot? Are we certain that it handles filesystems not being mounted yet properly?" [puppet] - 10https://gerrit.wikimedia.org/r/228179 (https://phabricator.wikimedia.org/T106590) (owner: 10Yuvipanda) [00:16:27] (03CR) 10coren: [C: 032] labstore: tweak alerting thresholds [puppet] - 10https://gerrit.wikimedia.org/r/228180 (owner: 10coren) [00:16:33] (03PS2) 10coren: labstore: tweak alerting thresholds [puppet] - 10https://gerrit.wikimedia.org/r/228180 [00:16:55] coren it's nfs-exports-daemon and yes it starts at startup if itis' on the 'active' labstore host [00:17:10] coren why would it care if things are mounted or not? [00:17:12] Will it cope with the interval between boot and start-nfs? [00:17:54] (03CR) 10BryanDavis: "I don't know this particular hiera magic, but if it ends up making an array of [logstash1001, logstash1002, logstash1003, logstash1004, lo" [puppet] - 10https://gerrit.wikimedia.org/r/227960 (https://phabricator.wikimedia.org/T104964) (owner: 10Muehlenhoff) [00:18:00] Because it attempts to exports directories and filesystems that aren't mounted and if it does so then it will completely mess up the fsids (cause 'stale filehandle's on the client) if it doesn't wait for the mounts [00:18:31] coren so exportfs -ra will cause problems if things aren't mounted? [00:18:59] Well, it'll export bits of / instead of the right filesystems so wrong inodes and devices, etc. [00:19:16] That's why starting it as part of start-nfs [00:19:25] (after the mounts) [00:19:42] s/ as part / was part / [00:20:59] That said, simply not making it start at boot and doing a systemctl start would work just as well as service start did [00:21:22] coren indeed. [00:21:46] coren actually, it doesn't start on boot - it starts on puppet run! [00:21:59] because I have the systemd unit not have an install directive [00:22:05] but there's a service directive in puppet that'll start it [00:22:48] Ah. Well, the net effect is the same since puppet is run at boot anyways. :-) [00:24:14] coren indeed [00:24:24] coren ok so I'll make it not run via puppet and add a call to start-nfs [00:24:26] 6operations, 10RESTBase, 10Traffic: Restbase insecure POST requests to MW api.php - https://phabricator.wikimedia.org/T107030#1497430 (10GWicke) https://github.com/wikimedia/restbase/pull/288 to address this was now merged, but is not deployed yet. It's part of a larger deploy, and will require matching conf... [00:25:30] (03CR) 10Krinkle: [C: 032] rl-test: Fix IP detection to use WebRequest::getIP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228173 (https://phabricator.wikimedia.org/T105255) (owner: 10Krinkle) [00:26:00] (03Merged) 10jenkins-bot: rl-test: Fix IP detection to use WebRequest::getIP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228173 (https://phabricator.wikimedia.org/T105255) (owner: 10Krinkle) [00:29:42] !log catrope Synchronized php-1.26wmf16/extensions/Flow/includes/Model/WikiReference.php: debugging (duration: 00m 13s) [00:29:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:31:38] 6operations, 6Services, 10Traffic: Provide an API listing at /api/ - https://phabricator.wikimedia.org/T107086#1497436 (10GWicke) @spage, do you have the right to edit protected pages on meta? [00:34:21] !log catrope Synchronized php-1.26wmf16/extensions/Flow/includes/Model/WikiReference.php: debugging (duration: 00m 12s) [00:34:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:35:49] !log catrope Synchronized php-1.26wmf16/extensions/Flow/includes/Model/WikiReference.php: debugging (duration: 00m 12s) [00:43:52] 6operations, 10Analytics-Cluster, 6Analytics-Kanban, 5Patch-For-Review: Build new latest stable (0.8.2.1?) Kafka package and upgrade Kafka brokers - https://phabricator.wikimedia.org/T106581#1497466 (10Ottomata) Working on migration plan: https://etherpad.wikimedia.org/p/kafka_0.8.2.1_migration [00:44:58] hmm, I seem to no longer have access to terbium. Who is a good person to ask about this? [00:46:39] kaldari: odd, your still in thedeployment group in puppet. [00:47:12] maybe springle? [00:47:36] kaldari@bast1001:~$ ssh terbium.eqiad.wmnet [00:47:36] Permission denied (publickey). [00:48:32] mutante? [00:48:38] kaldari: are you still using agent forwarding? [00:48:39] kaldari: oh, are you sshing from bast1001 to terbium? IIRC agent forwarding has been disabled [00:48:55] kaldari: you might need to use ProxyCommand [00:49:06] ebernhardson: OK, I'll try that [00:50:29] gwicke, ori: it comes from people too lazy to type cassandra; it comes from java devs who are used to using acronyms for everything [00:51:00] kaldari, https://wikitech.wikimedia.org/wiki/SSH_access#Production should work [00:51:59] Krenair: Thanks it's probably because I'm on a new machine and copied over the keys but not the config :) [00:52:07] yeah [00:52:08] urandom: :) [00:52:19] oh, right [00:52:20] yeah [00:52:29] you definitely can't ssh from bast1001 like that [01:03:24] (03PS2) 10Gergő Tisza: [WIP] Add sentry-phabricator package [software/sentry] - 10https://gerrit.wikimedia.org/r/227931 (https://phabricator.wikimedia.org/T97136) [01:05:18] (03CR) 10Gergő Tisza: "Sure. This is just something to play with while I wait for review on https://gerrit.wikimedia.org/r/#/c/199598/ . It does not seem particu" [software/sentry] - 10https://gerrit.wikimedia.org/r/227931 (https://phabricator.wikimedia.org/T97136) (owner: 10Gergő Tisza) [01:09:19] PROBLEM - IPsec on cp4020 is CRITICAL: Strongswan CRITICAL - 15 ESP transports installed, 1 problems (kernel-state-missing: 1) [01:09:19] PROBLEM - IPsec on cp4019 is CRITICAL: Strongswan CRITICAL - 15 ESP transports installed, 1 problems (kernel-state-missing: 1) [01:09:49] PROBLEM - IPsec on cp4012 is CRITICAL: Strongswan CRITICAL - 15 ESP transports installed, 1 problems (kernel-state-missing: 1) [01:09:50] PROBLEM - IPsec on cp4011 is CRITICAL: Strongswan CRITICAL - 15 ESP transports installed, 1 problems (kernel-state-missing: 1) [01:09:50] PROBLEM - IPsec on cp3017 is CRITICAL: Strongswan CRITICAL - 15 ESP transports installed, 1 problems (kernel-state-missing: 1) [01:10:00] PROBLEM - IPsec on cp1060 is CRITICAL: Strongswan CRITICAL - 8 ESP transports installed, 8 problems (kernel-state-missing: 8) [01:10:30] PROBLEM - IPsec on cp3018 is CRITICAL: Strongswan CRITICAL - 15 ESP transports installed, 1 problems (kernel-state-missing: 1) [01:10:30] PROBLEM - IPsec on cp3015 is CRITICAL: Strongswan CRITICAL - 15 ESP transports installed, 1 problems (kernel-state-missing: 1) [01:10:30] PROBLEM - IPsec on cp3016 is CRITICAL: Strongswan CRITICAL - 15 ESP transports installed, 1 problems (kernel-state-missing: 1) [01:31:49] PROBLEM - IPsec on cp3008 is CRITICAL: Strongswan CRITICAL - 31 ESP transports installed, 1 problems (kernel-state-missing: 1) [01:31:59] PROBLEM - IPsec on cp1055 is CRITICAL: Strongswan CRITICAL - 41 ESP transports installed, 1 problems (kernel-state-missing: 1) [01:32:41] bblack: ^ ? [01:46:09] PROBLEM - IPsec on cp1059 is CRITICAL: Strongswan CRITICAL - 11 ESP transports installed, 5 problems (kernel-state-missing: 5) [01:59:49] (03PS1) 10Mattflaschen: Convert wmgLiquidThreadsBackfill to wmgLiquidThreadsFrozen [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228192 (https://phabricator.wikimedia.org/T107068) [02:00:09] (03CR) 10Mattflaschen: [C: 04-2] "Do not deploy without coordination with Collaboration team." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228192 (https://phabricator.wikimedia.org/T107068) (owner: 10Mattflaschen) [02:01:40] PROBLEM - IPsec on cp1047 is CRITICAL: Strongswan CRITICAL - 9 ESP transports installed, 7 problems (kernel-state-missing: 7) [02:19:50] (03PS6) 10Wpmirrordev: Extend maximum allowed mediawiki version to 1.26 [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/171976 [02:20:46] 6operations, 10Wikimedia-Site-requests: Run "refreshLinks.php --dfn-only" on all wikis periodically - https://phabricator.wikimedia.org/T18112#1497582 (10Krenair) So... Guess we're just missing wikitech here? Or does that not count? :) [02:23:40] PROBLEM - IPsec on cp3010 is CRITICAL: Strongswan CRITICAL - 31 ESP transports installed, 1 problems (kernel-state-missing: 1) [02:23:40] PROBLEM - IPsec on cp3030 is CRITICAL: Strongswan CRITICAL - 31 ESP transports installed, 1 problems (kernel-state-missing: 1) [02:23:40] PROBLEM - IPsec on cp4018 is CRITICAL: Strongswan CRITICAL - 31 ESP transports installed, 1 problems (kernel-state-missing: 1) [02:23:41] PROBLEM - IPsec on cp4008 is CRITICAL: Strongswan CRITICAL - 31 ESP transports installed, 1 problems (kernel-state-missing: 1) [02:23:59] PROBLEM - IPsec on cp3041 is CRITICAL: Strongswan CRITICAL - 31 ESP transports installed, 1 problems (kernel-state-missing: 1) [02:24:20] PROBLEM - IPsec on cp3006 is CRITICAL: Strongswan CRITICAL - 31 ESP transports installed, 1 problems (kernel-state-missing: 1) [02:24:20] PROBLEM - IPsec on cp3004 is CRITICAL: Strongswan CRITICAL - 31 ESP transports installed, 1 problems (kernel-state-missing: 1) [02:24:20] PROBLEM - IPsec on cp3007 is CRITICAL: Strongswan CRITICAL - 31 ESP transports installed, 1 problems (kernel-state-missing: 1) [02:24:40] PROBLEM - IPsec on cp3031 is CRITICAL: Strongswan CRITICAL - 31 ESP transports installed, 1 problems (kernel-state-missing: 1) [02:24:49] PROBLEM - IPsec on cp3009 is CRITICAL: Strongswan CRITICAL - 31 ESP transports installed, 1 problems (kernel-state-missing: 1) [02:24:49] PROBLEM - IPsec on cp3014 is CRITICAL: Strongswan CRITICAL - 31 ESP transports installed, 1 problems (kernel-state-missing: 1) [02:24:49] PROBLEM - IPsec on cp1067 is CRITICAL: Strongswan CRITICAL - 27 ESP transports installed, 15 problems (kernel-state-missing: 15) [02:24:51] PROBLEM - IPsec on cp4010 is CRITICAL: Strongswan CRITICAL - 31 ESP transports installed, 1 problems (kernel-state-missing: 1) [02:25:10] PROBLEM - IPsec on cp4017 is CRITICAL: Strongswan CRITICAL - 31 ESP transports installed, 1 problems (kernel-state-missing: 1) [02:25:19] PROBLEM - IPsec on cp3003 is CRITICAL: Strongswan CRITICAL - 31 ESP transports installed, 1 problems (kernel-state-missing: 1) [02:28:17] !log l10nupdate Synchronized php-1.26wmf16/cache/l10n: (no message) (duration: 06m 13s) [02:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:28:59] 6operations, 10Wikimedia-Site-requests: Run "refreshLinks.php --dfn-only" on all wikis periodically - https://phabricator.wikimedia.org/T18112#1497596 (10Krenair) 5Open>3Resolved a:3Krenair I guess maintenance scripts on wikitech/silver is a larger issue, I'll open a separate ticket. [02:30:31] RECOVERY - IPsec on cp3006 is OK: Strongswan OK - Security Associations: 32 ESP transports installed [02:31:21] !log @tin LocalisationUpdate completed (1.26wmf16) at 2015-07-31 02:31:20+00:00 [02:31:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:32:43] 6operations, 6Commons, 10MediaWiki-Special-pages, 5MW-1.26-release, and 5 others: MIMEsearchPage::reallyDoQuery failing on the logs due to taking too long to query - https://phabricator.wikimedia.org/T107265#1497604 (10MZMcBride) >>! In T107265#1497228, @ori wrote: > This page is now marked as expensive, s... [02:34:50] RECOVERY - IPsec on cp3031 is OK: Strongswan OK - Security Associations: 32 ESP transports installed [02:36:59] RECOVERY - IPsec on cp3009 is OK: Strongswan OK - Security Associations: 32 ESP transports installed [02:37:50] RECOVERY - IPsec on cp4018 is OK: Strongswan OK - Security Associations: 32 ESP transports installed [02:39:30] RECOVERY - IPsec on cp3003 is OK: Strongswan OK - Security Associations: 32 ESP transports installed [02:41:20] RECOVERY - IPsec on cp4017 is OK: Strongswan OK - Security Associations: 32 ESP transports installed [02:42:30] RECOVERY - IPsec on cp3007 is OK: Strongswan OK - Security Associations: 32 ESP transports installed [02:42:55] 6operations, 6Labs, 10wikitech.wikimedia.org: Figure out what to do about maintenance scripts on silver/wikitech - https://phabricator.wikimedia.org/T107547#1497611 (10Krenair) 3NEW [02:43:10] RECOVERY - IPsec on cp4020 is OK: Strongswan OK - Security Associations: 16 ESP transports installed [02:47:23] 6operations, 6Labs, 10wikitech.wikimedia.org: Figure out what to do about maintenance scripts on silver/wikitech - https://phabricator.wikimedia.org/T107547#1497625 (10Krenair) [02:48:59] RECOVERY - IPsec on cp3014 is OK: Strongswan OK - Security Associations: 32 ESP transports installed [02:52:00] RECOVERY - IPsec on cp3041 is OK: Strongswan OK - Security Associations: 32 ESP transports installed [02:57:10] PROBLEM - IPsec on cp1066 is CRITICAL: Strongswan CRITICAL - 41 ESP transports installed, 1 problems (not-connected: 1) [02:59:10] RECOVERY - IPsec on cp1066 is OK: Strongswan OK - Security Associations: 42 ESP transports installed [02:59:49] RECOVERY - IPsec on cp4008 is OK: Strongswan OK - Security Associations: 32 ESP transports installed [03:04:29] well that's fascinating :P [03:04:47] bd808, trying to debug something on tin... why does eval.php still work if I stick syntax errors in commonsettings? I can't get it to take into account any change I make [03:05:09] Krenair: eval works from -staging [03:05:11] RECOVERY - IPsec on cp4010 is OK: Strongswan OK - Security Associations: 32 ESP transports installed [03:05:16] unless someone broke that... [03:05:16] yep [03:05:19] that's what I'm changing [03:05:35] in any case, it seems to have failed soft as intended. unencrypted traffic continues to flow. [03:07:33] hoo, if I stick errors in InitialiseSettings it does actually error out [03:07:37] but not commonsettings [03:09:09] very weird [03:09:16] try stracing it, maybe? [03:11:49] RECOVERY - IPsec on cp1060 is OK: Strongswan OK - Security Associations: 16 ESP transports installed [03:12:00] RECOVERY - IPsec on cp3010 is OK: Strongswan OK - Security Associations: 32 ESP transports installed [03:12:00] RECOVERY - IPsec on cp3030 is OK: Strongswan OK - Security Associations: 32 ESP transports installed [03:12:00] RECOVERY - IPsec on cp3017 is OK: Strongswan OK - Security Associations: 16 ESP transports installed [03:12:40] RECOVERY - IPsec on cp3016 is OK: Strongswan OK - Security Associations: 16 ESP transports installed [03:12:40] RECOVERY - IPsec on cp3004 is OK: Strongswan OK - Security Associations: 32 ESP transports installed [03:12:40] RECOVERY - IPsec on cp3018 is OK: Strongswan OK - Security Associations: 16 ESP transports installed [03:12:50] RECOVERY - IPsec on cp1067 is OK: Strongswan OK - Security Associations: 42 ESP transports installed [03:14:14] hoo, how do I actually use strace with mwscript? [03:15:41] RECOVERY - IPsec on cp1055 is OK: Strongswan OK - Security Associations: 42 ESP transports installed [03:15:50] RECOVERY - IPsec on cp4011 is OK: Strongswan OK - Security Associations: 16 ESP transports installed [03:15:50] RECOVERY - IPsec on cp4012 is OK: Strongswan OK - Security Associations: 16 ESP transports installed [03:16:00] RECOVERY - IPsec on cp3008 is OK: Strongswan OK - Security Associations: 32 ESP transports installed [03:16:01] RECOVERY - IPsec on cp1059 is OK: Strongswan OK - Security Associations: 16 ESP transports installed [03:16:39] RECOVERY - IPsec on cp3015 is OK: Strongswan OK - Security Associations: 16 ESP transports installed [03:17:00] Krenair: Just strace it? [03:17:10] RECOVERY - IPsec on cp4019 is OK: Strongswan OK - Security Associations: 16 ESP transports installed [03:17:30] RECOVERY - IPsec on cp1047 is OK: Strongswan OK - Security Associations: 16 ESP transports installed [03:17:42] [pid 25831] lstat("/srv/mediawiki/wmf-config/CommonSettings.php", {st_mode=S_IFREG|0644, st_size=103451, ...}) = 0 [03:17:42] [pid 25831] lstat("/srv/mediawiki/wmf-config", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 [03:17:42] [pid 25831] open("/srv/mediawiki/wmf-config/CommonSettings.php", O_RDONLY) = 3 [03:18:06] hoo, nothing useful there [03:18:11] Running sudo -u www-data strace -f mwscript showJobs.php --wiki testwiki but the wiki shouldn't matter [03:18:25] it barely outputs anything compared to when I run it with eval.php locally [03:18:42] you need to make it follow (-f) [03:19:29] Oh and you obviously need to run it under either root or the user you want to strace, thus sudo first [03:19:50] right, that looks much more useful [03:21:59] okay, it's because php-1.26wmf16/LocalSettings.php just goes straight for /srv/mediawiki/wmf-config/InitialiseSettings.php, completely ignoring mediawiki-staging. helpful [03:28:00] legoktm, help [03:28:09] hi [03:28:17] the visualeditor namespace config is completely broken [03:28:33] uhh [03:28:44] define broken? [03:29:03] VE is enabled on some talk namespaces. [03:29:21] what specific wiki? [03:29:21] namespaces IDs 0 to 7 on enwiki [03:29:28] and probably all the other wikis [03:29:30] ok [03:30:15] umm [03:30:23] Krenair: have you messed with somethin on tin? [03:30:31] mwscript eval.php --wiki=enwiki just var dumped stuff [03:30:53] yes [03:30:56] see git diff [03:31:16] I have the config generating a sane-looking wgVisualEditorAvailableNamespaces array [03:31:26] and it dumps the result and it's fine [03:32:06] but then stick in var_dump( $wgVisualEditorAvailableNamespaces ); [03:33:03] I don't know where that comes from. [03:33:20] so it works fine during config [03:34:10] That part is working fine with my changes to the config on tin [03:34:13] But it gets overwritten [03:34:49] and then it breaks sometime in initialization? [03:34:55] grr, internet lagging [03:35:01] the array merge looks suspicious [03:35:06] which array merge? [03:35:46] } elseif ( is_array( $GLOBALS[$key] ) && is_array( $val ) ) { [03:35:46] $GLOBALS[$key] = array_merge( $val, $GLOBALS[$key] ); [03:36:09] * legoktm live hacks a bit more [03:36:23] php > var_dump( array_merge( array( 1 => true ) ) ); [03:36:23] array(1) { [03:36:23] [0]=> [03:36:23] bool(true) [03:36:24] } [03:37:49] array_merge screws up numeric keys? [03:37:51] and the array_merge is necessary to fix things like $wgAvailableRights [03:37:55] yes [03:37:58] it's "intentional" [03:37:59] gah [03:39:09] I'm just going to revert for now [03:39:19] that sounds like a good idea [03:40:42] sorry, I should have caught this earlier :/ [03:40:55] so should I [03:41:41] doesn't cherry-pick cleanly :/ [03:47:19] PROBLEM - IPsec on cp1046 is CRITICAL: Strongswan CRITICAL - 10 ESP transports installed, 6 problems (kernel-state-missing: 6) [03:47:40] PROBLEM - IPsec on cp4011 is CRITICAL: Strongswan CRITICAL - 15 ESP transports installed, 1 problems (kernel-state-missing: 1) [03:47:49] PROBLEM - IPsec on cp3017 is CRITICAL: Strongswan CRITICAL - 15 ESP transports installed, 1 problems (kernel-state-missing: 1) [03:48:29] PROBLEM - IPsec on cp3015 is CRITICAL: Strongswan CRITICAL - 15 ESP transports installed, 1 problems (kernel-state-missing: 1) [03:48:29] PROBLEM - IPsec on cp3016 is CRITICAL: Strongswan CRITICAL - 15 ESP transports installed, 1 problems (kernel-state-missing: 1) [03:48:52] !log krenair Synchronized php-1.26wmf16/extensions/VisualEditor: https://gerrit.wikimedia.org/r/#/c/228197/ (duration: 00m 12s) [03:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:49:00] PROBLEM - IPsec on cp4020 is CRITICAL: Strongswan CRITICAL - 15 ESP transports installed, 1 problems (kernel-state-missing: 1) [03:49:00] PROBLEM - IPsec on cp4019 is CRITICAL: Strongswan CRITICAL - 15 ESP transports installed, 1 problems (kernel-state-missing: 1) [03:53:27] (03PS1) 10Alex Monk: Fix part of the VE NS config issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228198 [04:04:00] PROBLEM - Incoming network saturation on labstore1003 is CRITICAL 17.24% of data above the critical threshold [100000000.0] [04:09:24] !log upgrade/restart dbstore1001 [04:09:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:17:10] PROBLEM - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - pattern not found - 1423 bytes in 0.129 second response time [04:34:00] RECOVERY - Incoming network saturation on labstore1003 is OK Less than 10.00% above the threshold [75000000.0] [04:41:30] PROBLEM - IPsec on cp4012 is CRITICAL: Strongswan CRITICAL - 15 ESP transports installed, 1 problems (kernel-state-missing: 1) [04:42:50] PROBLEM - IPsec on cp1047 is CRITICAL: Strongswan CRITICAL - 11 ESP transports installed, 5 problems (kernel-state-missing: 5) [04:45:41] !log @tin ResourceLoader cache refresh completed at Fri Jul 31 04:45:41 UTC 2015 (duration 45m 40s) [04:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [04:47:47] (03PS4) 10MZMcBride: Add krinkle to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/227728 (https://phabricator.wikimedia.org/T107243) (owner: 10RobH) [04:48:10] RECOVERY - check if wikidata.org dispatch lag is higher than 2 minutes on wikidata is OK: HTTP OK: HTTP/1.1 200 OK - 1416 bytes in 0.112 second response time [04:50:54] hahah logmsgbot now includes the hostname? [04:56:25] yup [04:56:40] it should have $USER in there too... hmmm [04:57:00] I would have expected ti to say l10nupdate@tin ... [05:12:42] $LOGNAME? [05:13:24] it must have something to do with running from system level cron [05:42:29] RECOVERY - IPsec on cp4020 is OK: Strongswan OK - Security Associations: 16 ESP transports installed [05:43:10] RECOVERY - IPsec on cp4012 is OK: Strongswan OK - Security Associations: 16 ESP transports installed [05:43:34] (03PS1) 10Krinkle: rl-test: Track full XFF, not just the trusted "client" IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228204 [05:43:55] (03CR) 10Krinkle: [C: 032] rl-test: Track full XFF, not just the trusted "client" IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228204 (owner: 10Krinkle) [05:44:00] (03Merged) 10jenkins-bot: rl-test: Track full XFF, not just the trusted "client" IP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228204 (owner: 10Krinkle) [05:47:49] PROBLEM - puppet last run on ganeti2001 is CRITICAL puppet fail [05:50:11] RECOVERY - IPsec on cp1047 is OK: Strongswan OK - Security Associations: 16 ESP transports installed [05:53:40] PROBLEM - RAID on analytics1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:53:51] PROBLEM - SSH on analytics1013 is CRITICAL - Socket timeout after 10 seconds [05:54:20] PROBLEM - configured eth on analytics1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:54:40] PROBLEM - dhclient process on analytics1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:54:50] PROBLEM - Disk space on Hadoop worker on analytics1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:54:59] PROBLEM - Hadoop DataNode on analytics1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:54:59] PROBLEM - puppet last run on analytics1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:54:59] PROBLEM - DPKG on analytics1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:55:00] PROBLEM - salt-minion processes on analytics1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:55:49] PROBLEM - Disk space on analytics1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:13:40] RECOVERY - puppet last run on ganeti2001 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:28:49] PROBLEM - NTP on analytics1013 is CRITICAL: NTP CRITICAL: No response from NTP server [06:31:29] PROBLEM - puppet last run on subra is CRITICAL Puppet has 1 failures [06:31:40] PROBLEM - puppet last run on db2056 is CRITICAL Puppet has 1 failures [06:31:50] PROBLEM - puppet last run on mw2043 is CRITICAL Puppet has 1 failures [06:31:51] PROBLEM - puppet last run on mw2129 is CRITICAL Puppet has 1 failures [06:32:20] PROBLEM - puppet last run on mw2045 is CRITICAL Puppet has 1 failures [06:32:30] PROBLEM - puppet last run on lvs1003 is CRITICAL Puppet has 1 failures [06:32:39] PROBLEM - puppet last run on mw1135 is CRITICAL Puppet has 1 failures [06:33:10] PROBLEM - puppet last run on mw1061 is CRITICAL Puppet has 1 failures [06:33:21] PROBLEM - puppet last run on mw2207 is CRITICAL Puppet has 1 failures [06:33:40] PROBLEM - puppet last run on mw2126 is CRITICAL Puppet has 1 failures [06:34:00] PROBLEM - puppet last run on mw2050 is CRITICAL Puppet has 1 failures [06:55:40] RECOVERY - puppet last run on mw2043 is OK Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:56:20] RECOVERY - puppet last run on lvs1003 is OK Puppet is currently enabled, last run 1 second ago with 0 failures [06:57:20] RECOVERY - puppet last run on subra is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:57:30] RECOVERY - puppet last run on db2056 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:57:30] RECOVERY - puppet last run on mw2126 is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:57:40] RECOVERY - puppet last run on mw2129 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [06:57:49] heh, nice, this means _joe_ is on and working now :D [06:57:50] RECOVERY - puppet last run on mw2050 is OK Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:57:58] <_joe_> since about 1 hours [06:58:01] heh [06:58:06] you're early! [06:58:10] RECOVERY - puppet last run on mw2045 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:11] I thought you matched the puppetspam [06:58:30] RECOVERY - puppet last run on mw1135 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:39] _joe_ your -2 comments were addressed in https://gerrit.wikimedia.org/r/#/c/227887/ [06:58:51] RECOVERY - puppet last run on mw1061 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:01] <_joe_> YuviPanda: k, will take a look [06:59:13] _joe_ thanks [06:59:19] RECOVERY - puppet last run on mw2207 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:59:40] 6operations, 6Labs: Investigate jessie-backports for labstores - https://phabricator.wikimedia.org/T107507#1497768 (10MoritzMuehlenhoff) If the apt pinning for backports is configured in a way that backports is only selected on a case-by-case basis by running e.g. "apt-get install -t jessie-backports install f... [07:00:25] 6operations, 6Labs: Investigate jessie-backports for labstores - https://phabricator.wikimedia.org/T107507#1497769 (10yuvipanda) So is there a way for us to do -t with puppet? [07:25:44] (03PS2) 10Yuvipanda: labstore: Fixup start-nfs for new storage layout [puppet] - 10https://gerrit.wikimedia.org/r/228179 (https://phabricator.wikimedia.org/T106590) [07:25:49] (03CR) 10jenkins-bot: [V: 04-1] labstore: Fixup start-nfs for new storage layout [puppet] - 10https://gerrit.wikimedia.org/r/228179 (https://phabricator.wikimedia.org/T106590) (owner: 10Yuvipanda) [07:26:54] (03PS3) 10Yuvipanda: labstore: Fixup start-nfs for new storage layout [puppet] - 10https://gerrit.wikimedia.org/r/228179 (https://phabricator.wikimedia.org/T106590) [07:35:33] <_joe_> !log powercycling analytics1013, no ssh, console unresponsive [07:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:37:50] RECOVERY - SSH on analytics1013 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [07:38:30] RECOVERY - configured eth on analytics1013 is OK - interfaces up [07:38:40] RECOVERY - dhclient process on analytics1013 is OK: PROCS OK: 0 processes with command name dhclient [07:38:59] RECOVERY - Hadoop DataNode on analytics1013 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [07:39:00] RECOVERY - Disk space on Hadoop worker on analytics1013 is OK: DISK OK [07:39:00] RECOVERY - DPKG on analytics1013 is OK: All packages OK [07:39:00] RECOVERY - puppet last run on analytics1013 is OK Puppet is currently enabled, last run 2 hours ago with 0 failures [07:39:00] RECOVERY - salt-minion processes on analytics1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [07:39:30] RECOVERY - RAID on analytics1013 is OK no disks configured for RAID [07:39:50] RECOVERY - Disk space on analytics1013 is OK: DISK OK [07:42:50] RECOVERY - NTP on analytics1013 is OK: NTP OK: Offset -0.01955008507 secs [07:56:54] 6operations, 7discovery-system: Create a conftool "agent" that overcomes confd deficiencies - https://phabricator.wikimedia.org/T107285#1497830 (10Joe) [07:56:56] 6operations, 5Patch-For-Review, 7discovery-system: implement write locking in conftool - https://phabricator.wikimedia.org/T107286#1497829 (10Joe) 5Open>3Resolved [07:58:13] 6operations, 10ops-eqiad, 6Discovery, 10Wikidata, and 2 others: Change hardware RAID controller on wmf3543, wmf3544 - https://phabricator.wikimedia.org/T107152#1497833 (10Joe) @ksmith yes, @Cmjohnson simply didn't update the ticket. I am going to install the servers now. [08:01:54] (03CR) 10Matanya: [C: 031] Remove several dead domains from redirects [puppet] - 10https://gerrit.wikimedia.org/r/225041 (https://phabricator.wikimedia.org/T105981) (owner: 10Glaisher) [08:02:37] Nemo_bis: have you had any session loss errors in the last 12hrs or so? [08:03:35] 6operations, 6Labs: Investigate jessie-backports for labstores - https://phabricator.wikimedia.org/T107507#1497836 (10faidon) The difference between the two labstores is because of a change in upstream d-i during the jessie RC cycle that disabled backports by default, cf. [[ https://bugs.debian.org/764982 | De... [08:04:14] 6operations, 6Labs: Investigate whether to use Debian's jessie-backports - https://phabricator.wikimedia.org/T107507#1497837 (10faidon) p:5Triage>3Normal [08:16:50] (03PS1) 10Ori.livneh: Set $wgAjaxEditStash to false, on suspicion of being implicated in T102199 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228211 [08:17:58] (03CR) 10Ori.livneh: [C: 032] "It'd also be nice to reconfirm that this has a positive impact, even after several recent rounds of performance work on the page saving pa" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228211 (owner: 10Ori.livneh) [08:18:04] (03Merged) 10jenkins-bot: Set $wgAjaxEditStash to false, on suspicion of being implicated in T102199 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228211 (owner: 10Ori.livneh) [08:19:18] !log ori Synchronized wmf-config/CommonSettings.php: I7be6dd2f5: Set $wgAjaxEditStash to false, on suspicion of being implicated in T102199 (duration: 00m 12s) [08:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:30:18] (03PS1) 10Giuseppe Lavagetto: install-server: remove raid from wdqs [puppet] - 10https://gerrit.wikimedia.org/r/228213 [08:33:09] (03CR) 10Giuseppe Lavagetto: [C: 032] install-server: remove raid from wdqs [puppet] - 10https://gerrit.wikimedia.org/r/228213 (owner: 10Giuseppe Lavagetto) [08:33:51] 6operations, 7discovery-system: Create a conftool "agent" that overcomes confd deficiencies - https://phabricator.wikimedia.org/T107285#1497869 (10faidon) I'd be very cautious about implementing a system that essentially... sounds like a configuration management system (including templating, post-hook actions,... [08:36:36] ori: I've not really been editing [08:36:54] don't make me engage you [08:37:38] (03CR) 10Filippo Giunchedi: diamond: add upstart/systemd service stats (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/224093 (owner: 10Filippo Giunchedi) [08:37:57] Nemo_bis: the logs all exclusively for session_fail_preview, do you remember if you were seeing it for save attempts or just previews? [08:38:12] 6operations, 7discovery-system: Create a conftool "agent" that overcomes confd deficiencies - https://phabricator.wikimedia.org/T107285#1497875 (10Joe) Pybal is the first suspect, but we need something that can change file state reliably upon etcd state change fleet-wide. Another example would be haproxy conf... [08:38:12] also, are you using some preview-as-you-type gadget? [08:39:19] 6operations, 7discovery-system: Create a conftool "agent" that overcomes confd deficiencies - https://phabricator.wikimedia.org/T107285#1497884 (10ori) So should I not bother with https://gerrit.wikimedia.org/r/#/c/225649/ ? [08:40:14] ori: I got one on save today. No special gadget, just live preview. [08:40:23] (03PS1) 10Ori.livneh: Revert "Set $wgAjaxEditStash to false, on suspicion of being implicated in T102199" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228216 [08:42:08] <_joe_> ori: no, I think you should. If we integrate etcd directly into pybal, this is less immediately needed. We still need it for other things in the future where integration would not be immediate [08:42:32] (03CR) 10Ori.livneh: [C: 032] "Good news: it works well and has a substantial impact. Bad news: not related to T102199" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228216 (owner: 10Ori.livneh) [08:42:37] (03Merged) 10jenkins-bot: Revert "Set $wgAjaxEditStash to false, on suspicion of being implicated in T102199" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228216 (owner: 10Ori.livneh) [08:42:45] holy shit [08:42:48] there is a horny cat outside [08:42:51] i am going to fucking murder it [08:42:55] <_joe_> lol [08:43:52] <_joe_> ori, I had a lonely dog barking last night, an horny cat is the only worse thing I can imagine [08:44:25] 6operations, 7discovery-system: Create a conftool "agent" that overcomes confd deficiencies - https://phabricator.wikimedia.org/T107285#1497896 (10Joe) @ori not at all, if we do integrate pybal directly this is just a lower priority task. [08:44:38] haha, one of my neighbors is outside looking for it [08:44:43] looking very angry [08:45:06] <_joe_> ori: btw, https://gerrit.wikimedia.org/r/#/c/225649/ is in a reviewable state? [08:45:26] <_joe_> may I add things to it/ build other patches upon it? [08:46:49] if you remove the 'from pybal import USER_AGENT_STRING' and 'agent = USER_AGENT_STRING' lines, you can invoke it directly and it'll watch the conftool etcd keyspace for changes [08:46:57] but it's not hooked up to the rest of pybal yet [08:47:12] reviews / feedback / changes / additions / etc. more than welcome tho [08:47:48] (03PS4) 10Filippo Giunchedi: diamond: service stats puppet integration [puppet] - 10https://gerrit.wikimedia.org/r/224094 [08:47:50] (03PS4) 10Filippo Giunchedi: diamond: add upstart/systemd service stats [puppet] - 10https://gerrit.wikimedia.org/r/224093 [08:47:50] <_joe_> ori: ok, I think we need to add a few things there [08:48:15] (03PS1) 10Muehlenhoff: Add a separate Hiera source for dumps mirrors only reachable via IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/228217 (https://phabricator.wikimedia.org/T104991) [08:52:56] _joe_: go nuts, i'm not territorial about it at all [08:53:03] feel free to update the patch etc [08:53:34] or to adapt it into another patch altogether [08:54:28] <_joe_> ori: ok [08:55:14] the vague idea i had is that we have an abstract Configuration interface [08:55:36] the plan has always been to have pybal talk to etcd directly [08:55:58] I'm not sure why we're still talking about a separate agent for this use case [08:55:59] that has a filesystem- / inotify-based notification implementation, an etcd implementation, and a simple http client [08:56:35] and pybal would delegate to the right implementation based on the scheme of the configuration URI [08:57:04] (03CR) 10Filippo Giunchedi: Add a role to run a debdeploy master (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/227682 (owner: 10Muehlenhoff) [08:57:06] so: config = http://config-master.eqiad.wmnet/pybal/eqiad/zotero [08:57:23] or config = etcd://etcd1001.eqiad.wmnet/v2/pybal/eqiad/zotero [08:57:30] as for the other use cases (varnish, haproxy etc.), I can see the argument, but these erb templates generating confd templates are grossing me out [08:57:41] or config = file:///etc/pybal/eqiad/zotero [08:58:23] and talks about an even newer system that does this and is even less dumb than conftool feel like they are going in the wrong direction (at least to me) [08:58:57] <_joe_> paravoid: ok, just a sec. For pybal, I agree I'd like it to have embedded support [08:59:14] if the concern is that puppet is too slow, maybe we should invest some time in addressing this (e.g. partial puppet runs had been proposed before) before we go and implement a new system from scratch [08:59:38] <_joe_> ori: for etcd, the notable difference is it should react to remote changes triggering a config run [08:59:56] what do you mean? [09:00:20] ori: I see your neighbourhood never sleeps [09:00:27] <_joe_> you change something in etcd => the watcher in pybal notices => it modifies the config just for the bit that has changed [09:00:30] i imagined that it would react to changes in the configuration as represented in etcd the way it would to changes in the configuration as represented as files [09:00:44] ori: no, I don't use such a gadget [09:00:59] _joe_: i think that's a bit off. why have pybal update the config files? [09:01:09] <_joe_> sorry not the config files [09:01:16] <_joe_> its internal state [09:01:35] right [09:01:56] that's not fundamentally different from setting an inotify watch on a config file and then reloading it (and acting on changes) when the file is touched [09:02:05] <_joe_> yes [09:02:16] on an only tangially related point, pybal's current configuration mechanism needs to go [09:02:28] <_joe_> the eval(), right? [09:02:30] yes [09:02:35] oh no [09:02:38] there's an eval? [09:02:39] a simple http client reading e.g. json is fine [09:02:42] <_joe_> yes [09:02:46] ..there's an eval.. [09:02:51] ori: the configs are not json, they're python code [09:02:56] <_joe_> paravoid: that was my idea too [09:03:12] I joked at some point that we could implement etcd support by writing python in the config file [09:03:22] <_joe_> paravoid: ahah [09:03:27] not even ast.literal_eval [09:03:30] <_joe_> that would be awesome [09:03:44] metacircular load balancer [09:03:47] pybal in pybal [09:04:05] <_joe_> we can hot-patch pybal via config files. Take that erlang [09:05:17] so yeah, writing an agent that connects to etcd, reads some jinja2 templates, parses them and spews out python code to disk, so that the kernel can generate an event via inotify that would trigger pybal to read those files and eval() them... [09:05:26] when you see this all written down it sounds pretty crazy doesn't it? [09:05:36] so let's not do this pretty please? :) [09:05:40] so to me the fact that the current configuration format is suspect is yet another reason to try very hard to have the layout and structure of configuration data in the various configuration backends we imagine pybal supporting (files, http, etcd) as similar as possible [09:06:32] the only thought i had was whether pybal should talk back to etcd [09:06:45] i.e., whether each host should have a target state and a current state [09:06:59] yeah, eventually we'll need this [09:07:35] (03CR) 10Muehlenhoff: Add a role to run a debdeploy master (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/227682 (owner: 10Muehlenhoff) [09:07:54] I'm not sure if "target" and "current" are good names for that [09:08:05] as you also have the admin/runtime dimension as well [09:08:07] (03PS2) 10Muehlenhoff: Add a role to run a debdeploy master [puppet] - 10https://gerrit.wikimedia.org/r/227682 [09:08:35] (configured to be pooled but is down) [09:08:49] but that's besides your point I guess :) [09:08:59] how does systemd call them? [09:09:31] ? [09:09:32] systemd? [09:09:50] yeah, systemd (and upstart) have a similar notion [09:10:14] of how a service is in fact vs. what state the init daemon is trying to get it to [09:10:53] it's not exactly the same thing but maybe the terminology is useful, dunno :P [09:11:49] * ori sleeps [09:11:54] bye :) [09:11:59] <_joe_> bye :) [09:12:12] * paravoid eats [09:13:46] 6operations, 10ops-eqiad: db1059 raid degraded - https://phabricator.wikimedia.org/T107024#1497942 (10jcrespo) @cmjohnson I think this is one of the ones covered by warranty. [09:15:55] * _joe_ battles with partman [09:16:10] (03Abandoned) 10Muehlenhoff: Add ferm rules for rcstream [puppet] - 10https://gerrit.wikimedia.org/r/227218 (https://phabricator.wikimedia.org/T104981) (owner: 10Muehlenhoff) [09:32:33] _joe_: d-i-test VM has come handy before to test partman, it has just two disks ATM though [09:32:38] x265 to reprepro? [09:32:40] interesting [09:32:49] ffmpeg backport I guess [09:32:53] moritzm: that you? :) [09:33:08] <_joe_> godog: yeah I didn't think I needed it, sadly I might [09:33:16] <_joe_> paravoid: I guess godog? [09:34:48] yeah that's me, though later I've removed them as we don't need them for jessie ATM but for trusty (T103335) [09:35:07] and we'd be better off with an official jessie-backports anyway [09:35:44] no, that's filippo [09:38:47] 6operations, 10hardware-requests: eqiad: 1 hardware access request for labs on real hardware - https://phabricator.wikimedia.org/T106731#1498018 (10faidon) Could you elaborate what is it that you need? I don't see it being very networking-intensive as a task, so I think we might able to fit it in, but I may be... [09:46:46] (03CR) 10Glaisher: [C: 031] Enable VisualEditor on NS_PROJECT for meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228041 (https://phabricator.wikimedia.org/T107003) (owner: 10Jforrester) [10:12:33] (03PS3) 10Glaisher: Remove several dead domains from redirects [puppet] - 10https://gerrit.wikimedia.org/r/225041 (https://phabricator.wikimedia.org/T105981) [10:13:47] (03CR) 10Glaisher: "rebased" [puppet] - 10https://gerrit.wikimedia.org/r/225041 (https://phabricator.wikimedia.org/T105981) (owner: 10Glaisher) [10:17:20] paravoid: have a few minutes to give input in a task? (lists related) [10:17:31] yes! :) [10:17:50] https://phabricator.wikimedia.org/T90407 - not so exciting but a nice debate I guess :) [10:19:01] ohdear :) [10:21:08] (03CR) 10John F. Lewis: [C: 031] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/225041 (https://phabricator.wikimedia.org/T105981) (owner: 10Glaisher) [10:22:40] (03PS1) 10Jcrespo: returning db1035 to 100% load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228223 [10:27:39] (03CR) 10Jcrespo: [C: 032] returning db1035 to 100% load [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228223 (owner: 10Jcrespo) [10:30:14] !log jynus Synchronized wmf-config/db-eqiad.php: returning db1035 to 100% load (duration: 00m 12s) [10:30:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:30:27] 6operations, 10Gather, 10MobileFrontend, 7HHVM, and 2 others: [facebook/hhvm] Incorrect return value from eval, Closure generated in first eval pass is returned in the second eval pass #5502 - https://phabricator.wikimedia.org/T102937#1498082 (10Jhernandez) Thanks a lot folks! [10:36:08] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Wikidata Query Service hardware - https://phabricator.wikimedia.org/T86561#1498099 (10Joe) I am currently installing wdqs1001; upon validation of the install, I'll add wdqs1002 as well. [10:49:30] 6operations, 10Continuous-Integration-Infrastructure, 6Multimedia, 5Patch-For-Review: Investigate impact of switching from ffmpeg to libav (ffmpeg is not in Jessie) - https://phabricator.wikimedia.org/T103335#1498115 (10fgiunchedi) >>! In T103335#1481198, @MoritzMuehlenhoff wrote: >>>! In T103335#1480470,... [10:51:51] (03PS1) 10Muehlenhoff: Create a common ferm base class for the database hosts and move the existing labsdb slave definition over to it [puppet] - 10https://gerrit.wikimedia.org/r/228228 (https://phabricator.wikimedia.org/T104699) [10:55:23] 6operations, 10Continuous-Integration-Infrastructure, 6Multimedia, 5Patch-For-Review: Investigate impact of switching from ffmpeg to libav (ffmpeg is not in Jessie) - https://phabricator.wikimedia.org/T103335#1498117 (10MoritzMuehlenhoff) > I gave this a try, so far we already backported `x265` and `shine`... [10:57:26] 6operations, 10Continuous-Integration-Infrastructure, 6Multimedia, 5Patch-For-Review: Investigate impact of switching from ffmpeg to libav (ffmpeg is not in Jessie) - https://phabricator.wikimedia.org/T103335#1498118 (10MoritzMuehlenhoff) We could also disable shine, that's only relevant for embedded hardw... [11:08:16] (03CR) 10Jcrespo: [C: 04-1] "Let's add db1011.eqiad.wmnet to the monitoring, too. db1011 requires client connections to all 3306 db hosts." [puppet] - 10https://gerrit.wikimedia.org/r/228228 (https://phabricator.wikimedia.org/T104699) (owner: 10Muehlenhoff) [11:11:40] RECOVERY - IPsec on cp3015 is OK: Strongswan OK - Security Associations: 16 ESP transports installed [11:11:40] RECOVERY - IPsec on cp3016 is OK: Strongswan OK - Security Associations: 16 ESP transports installed [11:12:11] RECOVERY - IPsec on cp4019 is OK: Strongswan OK - Security Associations: 16 ESP transports installed [11:12:46] (03CR) 10Jcrespo: "Disregard my previous comment, it is already included on the internal. But lets include iron on both 3306 and 3307, the externa ip." [puppet] - 10https://gerrit.wikimedia.org/r/228228 (https://phabricator.wikimedia.org/T104699) (owner: 10Muehlenhoff) [11:12:49] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors [11:12:50] RECOVERY - IPsec on cp4011 is OK: Strongswan OK - Security Associations: 16 ESP transports installed [11:13:00] RECOVERY - IPsec on cp3017 is OK: Strongswan OK - Security Associations: 16 ESP transports installed [11:13:01] RECOVERY - IPsec on cp1046 is OK: Strongswan OK - Security Associations: 16 ESP transports installed [11:14:52] (03PS1) 10Giuseppe Lavagetto: wdqs: declare system user and group [puppet] - 10https://gerrit.wikimedia.org/r/228231 [11:14:54] (03PS1) 10Giuseppe Lavagetto: wdqs: add monitoring group [puppet] - 10https://gerrit.wikimedia.org/r/228232 [11:15:01] <_joe_> I'll be back later, now lunch [11:15:16] <_joe_> the icinga config issue is due to wdqs, btw, will fix it when I'm back [11:15:44] (03CR) 10jenkins-bot: [V: 04-1] wdqs: add monitoring group [puppet] - 10https://gerrit.wikimedia.org/r/228232 (owner: 10Giuseppe Lavagetto) [11:21:59] PROBLEM - puppet last run on mw1253 is CRITICAL Puppet has 1 failures [11:26:40] (03CR) 10Faidon Liambotis: [C: 04-2] "It's not doing anything right now, but it works fine, why would we remove it or decom' it?" [puppet] - 10https://gerrit.wikimedia.org/r/227997 (owner: 10Dzahn) [11:28:25] paravoid: don't we usually remove unused hosts from site.pp though? Even if they're fine? [11:28:41] I don't really see the point in at least this case [11:30:22] (03PS1) 10Faidon Liambotis: (WIP) Switch GeoIP2, adds proper IPv6 support [dns] - 10https://gerrit.wikimedia.org/r/228233 [11:30:52] bblack: ^ [11:30:57] bblack: it got V+2 too :) [11:31:06] bblack: but 2/3 NSes are not upgraded yet, will do so next week [11:32:37] (03PS2) 10Faidon Liambotis: (WIP) Switch GeoIP2, adds proper IPv6 support [dns] - 10https://gerrit.wikimedia.org/r/228233 [11:34:59] paravoid: :) [11:35:36] btw I don't think we actually use only-primary-map, can probably just delete it [11:35:49] (03PS1) 10Faidon Liambotis: authdns: remove chroot support from authdns-lint [puppet] - 10https://gerrit.wikimedia.org/r/228235 [11:35:52] unless that was to have a quick easy reconfig in some failure scenario I guess [11:36:11] ^^^ [11:37:00] (03CR) 10BBlack: [C: 031] authdns: remove chroot support from authdns-lint [puppet] - 10https://gerrit.wikimedia.org/r/228235 (owner: 10Faidon Liambotis) [11:37:12] iirc it was to be able to switch e.g. bits or upload back to eqiad but leaving the rest as they are [11:37:19] (03CR) 10BBlack: [C: 031] (WIP) Switch GeoIP2, adds proper IPv6 support [dns] - 10https://gerrit.wikimedia.org/r/228233 (owner: 10Faidon Liambotis) [11:37:36] but better ways to do this these days [11:37:37] we could do similar with the admin_state thing now anyways [11:38:35] (03PS1) 10Faidon Liambotis: Kill only-primary-map, unused and redundant [dns] - 10https://gerrit.wikimedia.org/r/228236 [11:39:05] (03CR) 10BBlack: [C: 031] Kill only-primary-map, unused and redundant [dns] - 10https://gerrit.wikimedia.org/r/228236 (owner: 10Faidon Liambotis) [11:39:53] (03PS3) 10Faidon Liambotis: (WIP) Switch to GeoIP2, adds proper IPv6 support [dns] - 10https://gerrit.wikimedia.org/r/228233 [11:40:13] (03PS2) 10Faidon Liambotis: authdns: remove chroot support from authdns-lint [puppet] - 10https://gerrit.wikimedia.org/r/228235 [11:40:20] (03CR) 10Faidon Liambotis: [C: 032 V: 032] authdns: remove chroot support from authdns-lint [puppet] - 10https://gerrit.wikimedia.org/r/228235 (owner: 10Faidon Liambotis) [11:41:03] (03PS2) 10Muehlenhoff: Create a common ferm base class for the database hosts and move the existing labsdb slave definition over to it [puppet] - 10https://gerrit.wikimedia.org/r/228228 (https://phabricator.wikimedia.org/T104699) [11:43:31] RECOVERY - Cassanda CQL query interface on restbase1008 is OK: TCP OK - 0.003 second response time on port 9042 [11:45:30] 6operations, 10Traffic, 5Patch-For-Review: Upgrade prod DNS daemons to gdnsd 2.2.0 - https://phabricator.wikimedia.org/T98003#1498193 (10faidon) gdnsd 2.2.0 packages were prepared and landed in Debian unstable. libmaxminddb & gdnsd 2.2.0 backports are now in jessie-wikimedia. integration-lightslave-jessie-1... [11:46:39] RECOVERY - puppet last run on mw1253 is OK Puppet is currently enabled, last run 24 seconds ago with 0 failures [11:53:41] (03PS1) 10Muehlenhoff: Enable ferm rules for role::mariadb::dbstore [puppet] - 10https://gerrit.wikimedia.org/r/228237 (https://phabricator.wikimedia.org/T104699) [12:02:14] (03Abandoned) 10Muehlenhoff: Add ferm rules for dbstore systems [puppet] - 10https://gerrit.wikimedia.org/r/226267 (https://phabricator.wikimedia.org/T104699) (owner: 10Muehlenhoff) [12:07:03] (03PS1) 10Muehlenhoff: Add ferm rules for role::mariadb::proxy [puppet] - 10https://gerrit.wikimedia.org/r/228239 (https://phabricator.wikimedia.org/T104699) [12:07:57] (03Abandoned) 10Muehlenhoff: Add ferm rules for dbproxy hosts [puppet] - 10https://gerrit.wikimedia.org/r/225851 (https://phabricator.wikimedia.org/T104699) (owner: 10Muehlenhoff) [12:10:22] !log restbase1008 bootstrap finished successfully [12:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:10:53] (03PS2) 10Muehlenhoff: Add base role for debdeploy clients [puppet] - 10https://gerrit.wikimedia.org/r/227683 [12:15:42] (03PS1) 10BBlack: global $::site_tier to replace varnish-level cluster_tier [puppet] - 10https://gerrit.wikimedia.org/r/228240 [12:15:44] (03PS1) 10BBlack: ipsec: no mobike, and split dpdaction on $site_tier [puppet] - 10https://gerrit.wikimedia.org/r/228241 [12:18:03] (03CR) 10Jcrespo: "I'm ok with this as the base, now let's not forget to apply it to labs::db::master and the future labs::db::slave." [puppet] - 10https://gerrit.wikimedia.org/r/228228 (https://phabricator.wikimedia.org/T104699) (owner: 10Muehlenhoff) [12:21:33] (03CR) 10Muehlenhoff: "Yes, will trickle into the other sub classes in followup commits (as already done for dbstore and proxy), then we can also check on a case" [puppet] - 10https://gerrit.wikimedia.org/r/228228 (https://phabricator.wikimedia.org/T104699) (owner: 10Muehlenhoff) [12:36:30] (03PS2) 10Giuseppe Lavagetto: wdqs: add monitoring group [puppet] - 10https://gerrit.wikimedia.org/r/228232 [12:36:53] (03PS2) 10Giuseppe Lavagetto: wdqs: declare system user and group [puppet] - 10https://gerrit.wikimedia.org/r/228231 [12:37:25] (03CR) 10Giuseppe Lavagetto: [C: 032] wdqs: declare system user and group [puppet] - 10https://gerrit.wikimedia.org/r/228231 (owner: 10Giuseppe Lavagetto) [12:44:34] (03PS3) 10Giuseppe Lavagetto: wdqs: add monitoring group [puppet] - 10https://gerrit.wikimedia.org/r/228232 [12:44:54] (03CR) 10Giuseppe Lavagetto: [C: 032] wdqs: add monitoring group [puppet] - 10https://gerrit.wikimedia.org/r/228232 (owner: 10Giuseppe Lavagetto) [12:52:07] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Wikidata Query Service hardware - https://phabricator.wikimedia.org/T86561#1498241 (10Joe) [12:52:10] 6operations, 10ops-eqiad, 6Discovery, 10Wikidata, and 2 others: Change hardware RAID controller on wmf3543, wmf3544 - https://phabricator.wikimedia.org/T107152#1498240 (10Joe) 5Open>3Resolved [12:52:54] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [12:53:55] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Wikidata Query Service hardware - https://phabricator.wikimedia.org/T86561#971449 (10Joe) wdqs1001 installed just fine (after I figured out I needed at least one deploy for trebuchet to work). Installing wdqs1002 now, then I'll s... [13:05:55] 6operations, 6Labs: Investigate whether to use Debian's jessie-backports - https://phabricator.wikimedia.org/T107507#1498261 (10coren) >>! In T107507#1497836, @faidon wrote: > The difference between the two labstores is because of a change in upstream d-i during the jessie RC cycle that disabled backports by d... [13:14:31] (03PS2) 10BBlack: global $::site_tier to replace varnish-level cluster_tier [puppet] - 10https://gerrit.wikimedia.org/r/228240 [13:15:00] (03CR) 10BBlack: [C: 032 V: 032] "Verily, it is Very Verified!" [puppet] - 10https://gerrit.wikimedia.org/r/228240 (owner: 10BBlack) [13:20:14] hashar: Can you have a look at https://integration.wikimedia.org/ci/job/mwext-testextension-zend/5623/console the MathSearch tests fail due to the new test in the Math extension [13:21:16] (03PS2) 10BBlack: ipsec: no mobike, and split dpdaction on $site_tier [puppet] - 10https://gerrit.wikimedia.org/r/228241 [13:21:53] physikerwelt____: then the new test has an issue ? [13:21:54] Fatal error: Cannot redeclare class MathUtilsTest in /mnt/jenkins-workspace/workspace/mwext-testextension-zend/src/extensions/Math/tests/MathUtilsTest.php on line 102 [13:21:55] (03CR) 10BBlack: [C: 032 V: 032] ipsec: no mobike, and split dpdaction on $site_tier [puppet] - 10https://gerrit.wikimedia.org/r/228241 (owner: 10BBlack) [13:21:55] :D [13:22:13] physikerwelt____: you probably want to split https://gerrit.wikimedia.org/r/#/c/196886/ in several changes [13:22:54] RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 7446.00161831 [13:23:14] physikerwelt____: that is probably related to the introduction of extension.json [13:24:55] hashar: look at https://gerrit.wikimedia.org/r/#/c/228245/ [13:29:04] hashar: but it does not seem to be a problem with the math extension itself https://gerrit.wikimedia.org/r/#/c/228246/ [13:29:09] 6operations, 6Labs, 3Labs-Sprint-102, 3Labs-Sprint-103, and 4 others: labstore has multiple unpuppetized files/scripts/configs - https://phabricator.wikimedia.org/T102478#1498283 (10coren) >>! In T102478#1494562, @faidon wrote: > Sounds good to me, although I'd really prefer it if we reinstalled labstore10... [13:30:40] (03CR) 10Hashar: [C: 032] Send $USER and $HOSTNAME with !log messages [tools/scap] - 10https://gerrit.wikimedia.org/r/228131 (https://phabricator.wikimedia.org/T106460) (owner: 10BryanDavis) [13:31:03] (03Merged) 10jenkins-bot: Send $USER and $HOSTNAME with !log messages [tools/scap] - 10https://gerrit.wikimedia.org/r/228131 (https://phabricator.wikimedia.org/T106460) (owner: 10BryanDavis) [13:31:10] 6operations, 6Labs, 3Labs-Sprint-102, 3Labs-Sprint-103, and 4 others: Reinstall labstore1001 and make sure everything is puppet-ready - https://phabricator.wikimedia.org/T107574#1498287 (10coren) 3NEW a:3coren [13:45:26] 6operations, 10Continuous-Integration-Infrastructure, 6Multimedia, 5Patch-For-Review: Investigate impact of switching from ffmpeg to libav (ffmpeg is not in Jessie) - https://phabricator.wikimedia.org/T103335#1498323 (10fgiunchedi) >>! In T103335#1498117, @MoritzMuehlenhoff wrote: >> I gave this a try, so... [13:45:59] 6operations, 6Labs: Investigate whether to use Debian's jessie-backports - https://phabricator.wikimedia.org/T107507#1498324 (10scfc) (I'd prefer the pinning approach because it's easier to read and I intend to use it in the future for #Tool-Labs execution instances, but:) If backports is only needed for `pyt... [13:47:01] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "A few nitpicks but LGTM in general" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/227887 (owner: 10coren) [13:50:54] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia: Support VP9 in TMH (Unable to decode) - https://phabricator.wikimedia.org/T55863#1498329 (10fgiunchedi) with `ffmpeg2theora` 0.29 from debian and an updated `ffmpeg` the video stream is recognized but I'm getting a segfault: ``` filippo@fil... [13:51:18] 6operations, 7HTTPS: download.wikipedia.org is using an invalid certificate - https://phabricator.wikimedia.org/T107575#1498330 (10Chmarkine) [13:58:11] 6operations, 7HTTPS: download.wikipedia.org is using an invalid certificate - https://phabricator.wikimedia.org/T107575#1498351 (10BBlack) dumps and download both sound like they should be moved to misc-web, IMHO. [14:02:24] (03PS1) 10Ottomata: Allow labstore1003 to rsync from stat servers [puppet] - 10https://gerrit.wikimedia.org/r/228251 (https://phabricator.wikimedia.org/T107576) [14:02:30] (03CR) 10coren: nrpe: add new checks for systemd unit health (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/227887 (owner: 10coren) [14:02:43] (03PS5) 10coren: nrpe: add new checks for systemd unit health [puppet] - 10https://gerrit.wikimedia.org/r/227887 [14:03:31] (03CR) 10Filippo Giunchedi: [C: 031] Add base role for debdeploy clients [puppet] - 10https://gerrit.wikimedia.org/r/227683 (owner: 10Muehlenhoff) [14:03:36] (03CR) 10Filippo Giunchedi: [C: 031] Add a role to run a debdeploy master [puppet] - 10https://gerrit.wikimedia.org/r/227682 (owner: 10Muehlenhoff) [14:05:21] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia: Support VP9 in TMH (Unable to decode) - https://phabricator.wikimedia.org/T55863#1498365 (10MoritzMuehlenhoff) The same segfault happens on when converting Snowdonia_by_drone.webm in Debian unstable, so this isn't limited to the backport, b... [14:05:41] 6operations, 10Analytics, 6Security: Purge > 90 days stat1002:/a/squid/archive/glam_nara - https://phabricator.wikimedia.org/T92340#1498366 (10Ottomata) Q: does the mediacounts dataset not cover these needs? http://dumps.wikimedia.org/other/mediacounts/README.txt [14:06:50] 6operations, 10Analytics: Upgrade stat1001 to Debian Jessie - https://phabricator.wikimedia.org/T76348#1498372 (10Ottomata) [14:09:03] 6operations, 10Analytics-Cluster, 5Interdatacenter-IPsec: Secure inter-datacenter web request log (Kafka) traffic - https://phabricator.wikimedia.org/T92602#1498385 (10Ottomata) [14:09:06] 6operations, 10Analytics-Cluster, 6Analytics-Kanban, 5Patch-For-Review: Build new latest stable (0.8.2.1?) Kafka package and upgrade Kafka brokers - https://phabricator.wikimedia.org/T106581#1472141 (10Ottomata) [14:09:35] !log restart cassandra on restbase1004 to apply java downgrade, missed from batch downgrade yesterday [14:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:10:21] (03PS6) 10coren: nrpe: add new checks for systemd unit health [puppet] - 10https://gerrit.wikimedia.org/r/227887 [14:10:37] 6operations, 10Analytics-Cluster, 6Analytics-Kanban, 5Patch-For-Review: Build new latest stable (0.8.2.1?) Kafka package and upgrade Kafka brokers - https://phabricator.wikimedia.org/T106581#1472141 (10Ottomata) [14:10:40] 6operations, 10Analytics-Cluster, 5Interdatacenter-IPsec: Secure inter-datacenter web request log (Kafka) traffic - https://phabricator.wikimedia.org/T92602#1498387 (10Ottomata) [14:11:22] (03CR) 10coren: [C: 032] "Guiseppe's review is always good enough. :-)" [puppet] - 10https://gerrit.wikimedia.org/r/227887 (owner: 10coren) [14:13:33] (03PS3) 10Muehlenhoff: Add a role to run a debdeploy master [puppet] - 10https://gerrit.wikimedia.org/r/227682 [14:13:48] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add a role to run a debdeploy master [puppet] - 10https://gerrit.wikimedia.org/r/227682 (owner: 10Muehlenhoff) [14:15:14] (03PS3) 10Muehlenhoff: Add base role for debdeploy clients [puppet] - 10https://gerrit.wikimedia.org/r/227683 [14:15:35] (03CR) 10Muehlenhoff: [C: 032 V: 032] Add base role for debdeploy clients [puppet] - 10https://gerrit.wikimedia.org/r/227683 (owner: 10Muehlenhoff) [14:17:55] (03PS1) 10coren: nrpe: Put the defines in the respectively named files [puppet] - 10https://gerrit.wikimedia.org/r/228258 [14:18:14] PROBLEM - puppet last run on cp3018 is CRITICAL puppet fail [14:18:15] PROBLEM - puppet last run on cp2023 is CRITICAL puppet fail [14:18:25] PROBLEM - puppet last run on cp2026 is CRITICAL puppet fail [14:18:34] YuviPanda: Quick fix for brain fart ^^ I split the defines into two files as needed - and put each in the wrong one. [14:19:44] PROBLEM - puppet last run on cp3042 is CRITICAL puppet fail [14:19:44] PROBLEM - puppet last run on cp3040 is CRITICAL puppet fail [14:19:44] PROBLEM - puppet last run on cp3010 is CRITICAL puppet fail [14:19:54] PROBLEM - puppet last run on cp3047 is CRITICAL puppet fail [14:20:09] Error 400 on SERVER: Puppet::Parser::AST::Resource failed with error ArgumentError: Invalid resource type nrpe::monitor_systemd_unit_state at /etc/puppet/modules/confd/manifests/init.pp:85 on node cp2023.codfw.wmnet [14:20:15] PROBLEM - puppet last run on cp2016 is CRITICAL puppet fail [14:20:16] is the cache fails starting to come in [14:20:21] (03PS2) 10coren: nrpe: Put the defines in the respectively named files [puppet] - 10https://gerrit.wikimedia.org/r/228258 [14:20:33] That puppet fail burst is mine. Fixing one. [14:20:35] PROBLEM - puppet last run on cp2014 is CRITICAL puppet fail [14:20:44] PROBLEM - puppet last run on cp1066 is CRITICAL puppet fail [14:20:54] PROBLEM - puppet last run on cp2011 is CRITICAL puppet fail [14:21:14] PROBLEM - puppet last run on cp2010 is CRITICAL puppet fail [14:21:15] PROBLEM - puppet last run on cp2003 is CRITICAL puppet fail [14:21:26] (03CR) 10coren: [C: 032] "Git doesn't make the change clear and makes complex diffs - the fix simply swaps the contents of both files so that the right define is in" [puppet] - 10https://gerrit.wikimedia.org/r/228258 (owner: 10coren) [14:21:54] PROBLEM - puppet last run on cp1071 is CRITICAL puppet fail [14:21:55] PROBLEM - puppet last run on cp4014 is CRITICAL puppet fail [14:21:55] PROBLEM - puppet last run on cp4015 is CRITICAL puppet fail [14:22:15] PROBLEM - puppet last run on cp4019 is CRITICAL puppet fail [14:22:25] PROBLEM - puppet last run on cp2022 is CRITICAL puppet fail [14:22:26] I'm not sure the content swap there has much to do with the puppet parser failure [14:22:35] PROBLEM - puppet last run on cp3006 is CRITICAL puppet fail [14:22:35] PROBLEM - puppet last run on cp1099 is CRITICAL puppet fail [14:22:36] PROBLEM - puppet last run on cp3030 is CRITICAL puppet fail [14:22:45] PROBLEM - puppet last run on cp1058 is CRITICAL puppet fail [14:22:46] PROBLEM - puppet last run on cp1065 is CRITICAL puppet fail [14:22:55] PROBLEM - puppet last run on cp1062 is CRITICAL puppet fail [14:23:15] PROBLEM - puppet last run on cp1073 is CRITICAL puppet fail [14:23:25] we'll see I guess! [14:23:30] bblack: The definition can't be found if it's in the wrong file. [14:23:34] PROBLEM - puppet last run on cp3009 is CRITICAL puppet fail [14:23:35] PROBLEM - puppet last run on cp4017 is CRITICAL puppet fail [14:23:36] PROBLEM - puppet last run on cp4005 is CRITICAL puppet fail [14:23:49] 6operations, 10Analytics-Cluster, 6Analytics-Kanban, 5Patch-For-Review: Build new latest stable (0.8.2.1?) Kafka package and upgrade Kafka brokers - https://phabricator.wikimedia.org/T106581#1498395 (10Ottomata) [14:24:15] PROBLEM - puppet last run on etcd1001 is CRITICAL puppet fail [14:24:25] Just tested on cp2014 [14:24:29] ok [14:24:29] And works. [14:24:35] PROBLEM - puppet last run on cp2015 is CRITICAL puppet fail [14:24:36] RECOVERY - puppet last run on cp2014 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [14:24:45] RECOVERY - puppet last run on cp1065 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [14:25:07] (03PS1) 10BBlack: charon tuning [puppet] - 10https://gerrit.wikimedia.org/r/228260 [14:25:24] PROBLEM - puppet last run on cp2019 is CRITICAL puppet fail [14:25:35] PROBLEM - puppet last run on cp3013 is CRITICAL puppet fail [14:25:55] PROBLEM - puppet last run on cp3031 is CRITICAL puppet fail [14:26:04] PROBLEM - puppet last run on cp3046 is CRITICAL puppet fail [14:26:24] PROBLEM - puppet last run on cp2017 is CRITICAL puppet fail [14:26:54] Yeay mismatch between the fail and the test. [14:30:02] (03CR) 10BBlack: [C: 032] charon tuning [puppet] - 10https://gerrit.wikimedia.org/r/228260 (owner: 10BBlack) [14:31:14] 6operations, 6Labs, 3Labs-Sprint-107, 3ToolLabs-Goals-Q4: Investigate kernel issues on labvirt** hosts - https://phabricator.wikimedia.org/T99738#1498398 (10Andrew) I think I've found a new issue with the .19 kernel, so investigating further today. [14:31:34] (03CR) 10Tim Landscheidt: Allow labstore1003 to rsync from stat servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/228251 (https://phabricator.wikimedia.org/T107576) (owner: 10Ottomata) [14:32:36] (03PS1) 10coren: labstore: add tests for working backups [puppet] - 10https://gerrit.wikimedia.org/r/228261 [14:33:11] 6operations, 7HTTPS: download.wikipedia.org is using an invalid certificate - https://phabricator.wikimedia.org/T107575#1498400 (10Chmarkine) [14:33:20] ACKNOWLEDGEMENT - nova-network process on labnet1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/nova-network andrew bogott This box is a WIP, these services are intentionally off for the time being. [14:33:38] 6operations, 10RESTBase-Cassandra, 6Services, 5Patch-For-Review, 7RESTBase-architecture: put new restbase servers in service - https://phabricator.wikimedia.org/T102015#1498401 (10GWicke) 1008 has now joined the cluster as well: ``` Datacenter: eqiad ================= Status=Up/Down |/ State=Normal/Leav... [14:33:52] (03PS1) 10coren: nrpe: trivial typo fixes to check_systemd_unit_lastrun [puppet] - 10https://gerrit.wikimedia.org/r/228262 [14:34:59] (03PS2) 10coren: nrpe: trivial typo fixes to check_systemd_unit_lastrun [puppet] - 10https://gerrit.wikimedia.org/r/228262 [14:35:46] (03CR) 10coren: "Trivial typos." [puppet] - 10https://gerrit.wikimedia.org/r/228262 (owner: 10coren) [14:35:53] (03CR) 10coren: [C: 032] "Trivial typos." [puppet] - 10https://gerrit.wikimedia.org/r/228262 (owner: 10coren) [14:36:47] 6operations, 10RESTBase: Update JDK 8 package in backports repo - https://phabricator.wikimedia.org/T104887#1498410 (10GWicke) The bootstrap has now finished, but GC still looks significantly less efficient: {F291830} [14:37:08] (03PS2) 10coren: labstore: add tests for working backups [puppet] - 10https://gerrit.wikimedia.org/r/228261 [14:37:24] YuviPanda: ^^ this adds the tests with very low limits to check that it triggers okay. [14:37:40] I'll also force a backup to fail to test that one. [14:37:46] (Once the test is in place) [14:38:25] !log bumped the kernel version on labvirt1005, rebooting. [14:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:39:15] (03CR) 10Yuvipanda: [C: 031] labstore: add tests for working backups [puppet] - 10https://gerrit.wikimedia.org/r/228261 (owner: 10coren) [14:39:50] (03CR) 10coren: [C: 032] "Alarms expected." [puppet] - 10https://gerrit.wikimedia.org/r/228261 (owner: 10coren) [14:41:28] PROBLEM - Host labvirt1005 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:19] RECOVERY - Host labvirt1005 is UPING OK - Packet loss = 0%, RTA = 1.44 ms [14:44:46] !log Update cxserver to 9669e19 [14:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:44:59] RECOVERY - puppet last run on cp2023 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:45:19] RECOVERY - puppet last run on cp2026 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:45:28] RECOVERY - puppet last run on cp3040 is OK Puppet is currently enabled, last run 21 seconds ago with 0 failures [14:45:28] RECOVERY - puppet last run on cp3010 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:45:59] RECOVERY - puppet last run on cp3047 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:46:40] RECOVERY - puppet last run on cp3018 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:46:41] RECOVERY - puppet last run on cp2003 is OK Puppet is currently enabled, last run 53 seconds ago with 0 failures [14:47:00] RECOVERY - puppet last run on cp2016 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:47:11] RECOVERY - puppet last run on cp3042 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:47:11] RECOVERY - puppet last run on cp4014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:47:21] RECOVERY - puppet last run on cp1058 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [14:47:51] RECOVERY - puppet last run on cp3030 is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures [14:48:01] RECOVERY - puppet last run on cp1099 is OK Puppet is currently enabled, last run 19 seconds ago with 0 failures [14:48:05] (03PS1) 10Matthias Mullie: Get rid of $wgFlowOccupyPages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228271 (https://phabricator.wikimedia.org/T105574) [14:48:21] RECOVERY - puppet last run on cp2010 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:48:21] RECOVERY - puppet last run on cp2011 is OK Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:48:21] RECOVERY - puppet last run on cp4019 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:48:33] RECOVERY - puppet last run on cp1071 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:48:40] RECOVERY - puppet last run on cp4005 is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures [14:48:50] RECOVERY - puppet last run on cp1073 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:49:11] RECOVERY - puppet last run on cp4015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:49:20] RECOVERY - puppet last run on cp1066 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [14:49:41] RECOVERY - puppet last run on cp4017 is OK Puppet is currently enabled, last run 44 seconds ago with 0 failures [14:49:41] RECOVERY - puppet last run on cp3006 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:50:20] RECOVERY - puppet last run on cp3031 is OK Puppet is currently enabled, last run 6 seconds ago with 0 failures [14:50:21] RECOVERY - puppet last run on cp2019 is OK Puppet is currently enabled, last run 14 seconds ago with 0 failures [14:50:41] RECOVERY - puppet last run on cp3009 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:50:41] RECOVERY - puppet last run on cp1062 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:50:41] RECOVERY - puppet last run on cp2022 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:51:21] RECOVERY - puppet last run on cp3013 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:51:31] RECOVERY - puppet last run on cp2015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:52:21] RECOVERY - puppet last run on etcd1001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:53:30] 6operations, 10Wikimedia-DNS: Set up compat redirect stats.wikipedia.org -> stats.wikimedia.org - https://phabricator.wikimedia.org/T21353#1498447 (10Chmarkine) [14:54:25] !log turned on alerting of backup status on labstore* with (by design) low limits. Expect alarms, and ignore. [14:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:54:45] RECOVERY - puppet last run on cp3046 is OK Puppet is currently enabled, last run 3 minutes ago with 0 failures [14:55:14] RECOVERY - puppet last run on cp2017 is OK Puppet is currently enabled, last run 4 minutes ago with 0 failures [15:01:06] Coren or paravoid, can I have a moment of your time to double-check something? (possibly more than a moment) [15:01:12] First ssh into labvirt1005-networktest101.eqiad.wmflabs [15:01:30] andrewbogott: Sure. [15:02:29] once you’re logged in, ping a couple of things. I’ve been using google.com and tools.wmflabs.org [15:02:47] (this step is just to confirm that you can indeed communicate with the outside world.) [15:03:27] !log bouncing restbase1005 (attempting to reproduce GC trends) [15:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:06:21] Coren: with me so far? [15:11:14] !added iojs/iojs-dbg 1.8.4 to jessie-wikimedia on carbon [15:11:42] moritzm: nice! [15:11:50] Is it hard to get iojs for trusty too? [15:13:54] 7Blocked-on-Operations, 6operations, 10Parsoid, 6Services: Offer io.js on Jessie - https://phabricator.wikimedia.org/T91855#1498510 (10MoritzMuehlenhoff) >>! In T91855#1484033, @MoritzMuehlenhoff wrote: > We can surely add them for now so that you can experiment/test with it, I can do that this week. I ha... [15:14:52] YuviPanda: no, but the ticket was named "Offer io.js on Jessie", so I didn't consider that necessary :-) [15:15:31] moritzm: can I ask for it on trusty too? :) so I can offer it to toollabs users [15:17:24] (03PS1) 10Jcrespo: Revert "returning db1035 to 100% load" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228279 [15:17:43] andrewbogott: Yes, I'm with you. (Sorry, got distracted) [15:17:45] (03CR) 10Jcrespo: [C: 032] Revert "returning db1035 to 100% load" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228279 (owner: 10Jcrespo) [15:18:03] Coren: ok, now I’m going to suspend and resume that instance. You’ll notice that your session freezes and then unfreezes... [15:18:46] ok, unfrozen? [15:18:50] I Yep [15:19:06] So, the fact that you have an active shell means that that instance is networked, for some value of ‘networked' [15:19:14] But, try your pings again [15:19:24] (03PS1) 10Giuseppe Lavagetto: Add wdqs1002 [dns] - 10https://gerrit.wikimedia.org/r/228280 [15:19:55] !log jynus Synchronized wmf-config/db-eqiad.php: reverting db1035 load to 10% (duration: 00m 14s) [15:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:20:34] andrewbogott: Hm. Interesting. [15:20:37] Coren: if you’re like me, each ping gets exactly one packet and then no more packets again ever [15:20:54] (03CR) 10Giuseppe Lavagetto: [C: 032] Add wdqs1002 [dns] - 10https://gerrit.wikimedia.org/r/228280 (owner: 10Giuseppe Lavagetto) [15:21:20] andrewbogott: That's clearly a bug in the kernel clock. It looks like time got thrown by the suspend/resume cycle so that the sleep() between pings never returns. [15:21:36] andrewbogott: Because ping never *transmits* another packet [15:21:40] oh! The clock on the VM, you mean, or on the hosting box? [15:21:48] on the vm [15:21:56] huh [15:22:19] I would’ve thought that the resume process sent some kind of explicit ‘ok, now resync your clock’ message [15:22:31] you think if we force an ntp refresh it will shape up? [15:23:32] YuviPanda: it's currently meant for experiments by the Services team, before offering it to a wider audience, I think it needs a longterm plan (like the upcoming LTS branch) [15:24:17] andrewbogott: I don't know if it's the rtc that got confused. I don't think sleep() uses that. But I can tell you that 'sleep 2' on the command line also hangs so it's consistent. [15:24:42] ok [15:24:50] andrewbogott: Looks like the suspend/resume broke monotonicity somehow and the kernel got confused. [15:25:10] I feel like back on the 3.13 kernel this didn’t happen (although often the virt host crashed so soon it was hard to tell) [15:25:24] I mean, 3.13 on the virt host, not on the VM [15:25:24] Yep. Kernel for sure: the sleep hangs in the nanosleep() system call. [15:25:28] But maybe I just never tested it enough [15:26:02] moritzm: fair enough [15:27:54] Coren: I’m waiting for you to say something like “Changing the virt host kernel version could definitely not affect this issue" [15:37:03] (03PS1) 10Andrew Bogott: Disable instance suspend on Horizon. [puppet] - 10https://gerrit.wikimedia.org/r/228288 [15:37:49] (03PS2) 10Andrew Bogott: Disable instance suspend on Horizon. [puppet] - 10https://gerrit.wikimedia.org/r/228288 [15:38:38] (03CR) 10Andrew Bogott: [C: 032] Disable instance suspend on Horizon. [puppet] - 10https://gerrit.wikimedia.org/r/228288 (owner: 10Andrew Bogott) [15:42:44] (03PS1) 10coren: labstore: make the replication systemd units oneshot [puppet] - 10https://gerrit.wikimedia.org/r/228289 [15:42:49] 6operations, 6Commons: Commons thumbnail of Pluto photo is broken at 500px - https://phabricator.wikimedia.org/T105793#1498598 (10Aklapper) 5Open>3declined a:3Aklapper >>! In T105793#1466263, @MZMcBride wrote: > Do we happen to collect/log instances of how often we encounter this specific error ("Image w... [15:42:50] bblack: ^^ [15:43:44] Coren: I don't think we need Restart=no either, I suspect that was a hack around lack of Oneshot [15:44:28] (03PS2) 10coren: labstore: make the replication systemd units oneshot [puppet] - 10https://gerrit.wikimedia.org/r/228289 [15:44:28] w/out then [15:44:53] (03CR) 10BBlack: [C: 031] labstore: make the replication systemd units oneshot [puppet] - 10https://gerrit.wikimedia.org/r/228289 (owner: 10coren) [15:45:37] (03CR) 10coren: [C: 032] "To see if that fixes some oddities with status reporting" [puppet] - 10https://gerrit.wikimedia.org/r/228289 (owner: 10coren) [15:45:46] * Coren tries that [15:45:48] !log rebooting labvirt1005, again (3.16 this time) [15:45:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:46:18] !log Rebuilt kibana-int index to have 1 shard/2 replicas in logstash cluster [15:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:47:10] 6operations, 10Beta-Cluster, 5Patch-For-Review: Unify ::production / ::beta roles for *oid - https://phabricator.wikimedia.org/T86633#1498645 (10Aklapper) @yuvipanda what's left on this task? also is it blocked on anything? [15:48:15] bblack: Nope. root@labstore1002:~# [15:48:15] root@labstore1002:~# [15:48:23] Err, https://tools.wmflabs.org/paste/view/981a307c [15:49:25] PROBLEM - Host labvirt1005 is DOWN: PING CRITICAL - Packet loss = 100% [15:49:41] bblack: Honestly, that's beginning to confuse me a little. [15:49:42] Coren: if you define a timer for it, does it start showing the active stamp? this all seems very inconsistent [15:50:12] (also I have no idea if you've done systemctl daemon-reload either, or puppet did?) [15:50:38] bblack: I'd have expected puppet to have done it, but I'll double check. [15:51:05] (and I'd still upgrade packages too, there could be a relevant systemd bugfix in the updates) [15:52:05] RECOVERY - Host labvirt1005 is UPING OK - Packet loss = 0%, RTA = 1.19 ms [15:52:14] !log Rebuilt grafana-dashboards index to have 1 shard/2 replicas in logstash cluster [15:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:52:56] those indexes having 5 shards has been bugging me forever :) [15:53:20] 6operations, 10hardware-requests, 7Database: new external storage cluster(s) - https://phabricator.wikimedia.org/T105843#1498676 (10RobH) This has two vendors (dell/hp) being tracked for quoting: hp: https://rt.wikimedia.org/Ticket/Display.html?id=9507 dell: https://rt.wikimedia.org/Ticket/Display.html?id=... [15:53:21] bblack: There were updates, but they don't change this. Next step is trying to define a timer too. [15:55:55] well probably not relevant to this problem, but IMHO just "apt-get -y upgrade" everything unless there's a reason [15:56:03] (03PS1) 10Jgreen: a bit of reorg for clarity [software/otrs] - 10https://gerrit.wikimedia.org/r/228292 [15:56:07] it's out of date on a ton of things [15:58:14] PROBLEM - Host labvirt1005 is DOWN: PING CRITICAL - Packet loss = 100% [16:00:35] RECOVERY - Host labvirt1005 is UPING OK - Packet loss = 0%, RTA = 1.47 ms [16:01:11] bblack: Aha! The presence of a timer fixes it! [16:02:03] odd [16:02:52] bblack: And it's not a one-time deal, removing the timer again removes the extra information. Kinda odd, I suppose, and means that I can't guard against the timer not existing beyond "Hey, there is no last run information" which - in itself - I suppose is an error state for this purpose [16:03:16] Allright, reworking the check completely then. [16:09:42] (03PS1) 10Cmjohnson: Adding dns for analytics1046-49/ adding wmf**** for an1028+ [dns] - 10https://gerrit.wikimedia.org/r/228297 [16:09:50] (03CR) 10jenkins-bot: [V: 04-1] Adding dns for analytics1046-49/ adding wmf**** for an1028+ [dns] - 10https://gerrit.wikimedia.org/r/228297 (owner: 10Cmjohnson) [16:11:21] (03PS2) 10Cmjohnson: Adding dns for analytics1046-49/ adding wmf**** for an1028+ [dns] - 10https://gerrit.wikimedia.org/r/228297 [16:13:07] (03PS3) 10Cmjohnson: Adding dns for analytics1046-49/ adding wmf**** for an1028+ [dns] - 10https://gerrit.wikimedia.org/r/228297 [16:14:17] (03CR) 10Cmjohnson: [C: 032] Adding dns for analytics1046-49/ adding wmf**** for an1028+ [dns] - 10https://gerrit.wikimedia.org/r/228297 (owner: 10Cmjohnson) [16:14:20] legoktm: There is a follow up for the extjson change for the math extension https://gerrit.wikimedia.org/r/#/c/228298/1 [16:15:02] 6operations, 6Services, 10hardware-requests: Assign wmf4541,wmf4543 for service cluster expansion as scb1001, scb1002 - https://phabricator.wikimedia.org/T107287#1498733 (10mobrovac) >>! In T107287#1494272, @mark wrote: > We can also buy (or rent) servers, if that would be better. :) For this temporary tran... [16:16:19] legoktm it would be good if that change landed in 1.26wmf17 [16:20:36] (03PS1) 10BryanDavis: Send $LOGUSER with dologmsg messages [puppet] - 10https://gerrit.wikimedia.org/r/228299 [16:33:33] greg-g: while i understand it's friday, i'd like to deploy https://gerrit.wikimedia.org/r/#/c/228300/ today [16:33:45] it fixes breakage of xml format in wikibase api modules [16:33:51] which some people do use [16:34:07] it's essentially back to how we had it before [16:34:43] aude: gotcha, ok [16:34:59] ok, thanks [16:49:05] * aude waits for jenkins [16:54:40] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Wikidata Query Service hardware - https://phabricator.wikimedia.org/T86561#1498911 (10Joe) 5Open>3Resolved [16:56:54] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service, and 3 others: Wikidata Query Service hardware - https://phabricator.wikimedia.org/T86561#1498925 (10Joe) Both servers are up and running and the wdqs-blazegraph service is running and the banner page shows on port 80. I will finish this work (... [17:01:27] ping gilles [17:01:36] were you intending to deploy https://gerrit.wikimedia.org/r/#/c/228218/ ? [17:03:11] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service: Assign an LVS service to the wikidata query service - https://phabricator.wikimedia.org/T107601#1498950 (10Joe) 3NEW a:3Joe [17:06:25] !log aude Synchronized php-1.26wmf16/extensions/Wikidata: Fix api xml format (duration: 00m 20s) [17:06:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:07:14] 6operations, 6Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Set up a public interface to the wikidata query service - https://phabricator.wikimedia.org/T107602#1498962 (10Joe) 3NEW [17:08:35] greg-g: i like that gerrit generates submodule update patches for core [17:08:37] aude: I guess not? he also has this listed for Monday's swat: https://gerrit.wikimedia.org/r/#/c/228219/ [17:08:46] but not sure it's a good idea to automatically merge them [17:08:52] oh, that's just the extension/submodule bump [17:08:54] since people don't realize that happens [17:08:54] right [17:08:59] greg-g: it is [17:09:09] I just realized what you're saying via a different path of discovery :) [17:09:28] i will just leave it on tin and note in SAL [17:09:30] twentyafterfour: ^ is that intentional? [17:09:53] where "that" == "auto generate submodule patches and merging them for core" [17:10:34] !log wmf/1.26wmf16 core submodule bump for Ic25edf7 (MultimediaViewer) is now on tin [17:10:39] there [17:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:11:16] !log follow on to previous to be explicit: it's not deployed, it is queued for Monday morning SWAT [17:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:11:28] :) [17:11:34] :) [17:11:38] PROBLEM - Host labvirt1005 is DOWN: PING CRITICAL - Packet loss = 100% [17:11:46] eh? ^ [17:11:58] YuviPanda: ^ known? [17:13:38] jynus: you here? [17:13:52] Steinsplitter, yes [17:13:59] RECOVERY - Host labvirt1005 is UPING OK - Packet loss = 0%, RTA = 1.21 ms [17:14:26] jynus: whre are all toollabs dbs located. i try to find it out since one hour... :................ [17:14:43] where as in which host? [17:14:49] mysql --defaults-file=~/replica.my.cnf -h s1.labsdb -e "show databases;" [17:14:56] no [17:15:00] where then? [17:15:03] it is on a separate host [17:15:12] may i ask which one? [17:15:16] wait [17:15:34] I know ip, etc, but I have to look up the file you use [17:16:08] there was a commonsdelinquent_p before, but i can't find it on s1 nor commonswiki (.labsdb) [17:17:31] greg-g yes andrew is working on it [17:17:41] tools.db is the host for the tools db [17:17:42] it's a test host [17:17:46] YuviPanda: kk [17:17:47] something weird is happening when i type wikitext (things like ~~~~) because of the input tools thing from ULS :/ [17:17:53] or :: to indent [17:17:57] ^Steinsplitter [17:18:07] but I do not know if you refer to that [17:18:13] greg-g he's just using it to test kernel upgrades [17:19:04] jynus: i just like to know where commonsdelinquent_p is located now. it was before on commonswiki.labsdb [17:20:22] Steinsplitter, I cannot find that db on any of the hosts [17:20:34] gone ,O [17:20:36] ok, thanks [17:21:09] Steinsplitter, did you create that? [17:21:20] no [17:21:41] then probably the user deleted it [17:21:58] PROBLEM - Host labvirt1005 is DOWN: PING CRITICAL - Packet loss = 100% [17:22:33] YuviPanda: andrewbogott should we shut up icinga? :) [17:24:15] andrewbogott: ^ (if you aren't done with it) [17:24:42] I’m still rebooting 1005 frantically. I’ll mute icinga [17:25:04] Every boot is always supposed to be the last one but never is [17:25:21] :) [17:25:29] RECOVERY - Host labvirt1005 is UPING OK - Packet loss = 0%, RTA = 1.83 ms [17:28:50] 6operations, 10RESTBase: Update JDK 8 package in backports repo - https://phabricator.wikimedia.org/T104887#1499010 (10fgiunchedi) indeed, some nodes are still showing high GC, `restbase1005` had cassandra restarted to attempt to replicate the behaviour but I think it'll take some hours anyway. p50 latencies h... [17:32:54] (03PS2) 10Yuvipanda: Labs: Subscribe self-hosted puppetmaster to hiera.yaml changes [puppet] - 10https://gerrit.wikimedia.org/r/227622 (https://phabricator.wikimedia.org/T107205) (owner: 10Tim Landscheidt) [17:35:51] kaldari: whats up [17:37:22] (03CR) 10Yuvipanda: [C: 032] Labs: Subscribe self-hosted puppetmaster to hiera.yaml changes [puppet] - 10https://gerrit.wikimedia.org/r/227622 (https://phabricator.wikimedia.org/T107205) (owner: 10Tim Landscheidt) [17:40:32] jzerebecki: you here? [17:41:25] !log revert to openjdk8 and restart cassandra on restbase1001 T104887 [17:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:53:28] jynus, so will ops definitely be buying new hardware as a stop-gap? [17:53:32] !log revert to openjdk8 and restart cassandra on restbase1002 [17:53:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:54:02] Checking to make sure Flow isn't blocking phase 1 of solving the issue. For later, we still want to do https://phabricator.wikimedia.org/T106363 and need confirmation that ops will be willing to set up a Flow-specific External Store cluster. [17:55:09] matt_flaschen, are you talking about es, labs? [17:55:37] jynus, I'm talking about External Store in production. What labs issue are you referring to? [17:55:58] PROBLEM - puppet last run on cp3031 is CRITICAL puppet fail [17:56:33] do not worry, was talking about an unrelated topic on another public channel [17:57:03] matt_flaschen, I think my manager already have some hardware quotes [17:57:34] the idea is to fix the issue on hardware first to keep the machine running without depending on software [17:57:55] I think tim proposed do some refactoring at the same time [17:58:35] jynus, okay, that's what I wanted to confirm. In parallel, though, please sync up with us on whether ops is okay with our desired medium-term solution of making a separate Flow cluster: https://phabricator.wikimedia.org/T106363#1487965 [17:58:39] but the clock is ticking, so it will be done with or without it to keep things running [17:59:07] jynus, where did Tim mention this refactoring, ops list? [17:59:25] it is one of the 11111 phabricator tickets [17:59:35] related to es/compression [17:59:50] matt_flaschen, I do not know much about flow [17:59:53] Coren: ok, kernel 3.16.0-45-generic doesn’t have the clock-breakage problem. [18:00:02] Now I just have to figure out if it can survive frequent suspend/resumes... [18:00:04] but x1 has capacity [18:00:11] so I am surprised about that [18:00:23] jynus, surprised about what? [18:00:59] about specifically needing extra hardware [18:01:12] specially now that es will have much more capacity [18:01:16] 6operations, 10Analytics, 6Discovery, 10MediaWiki-General-or-Unknown, and 5 others: Reliable publish / subscribe event bus - https://phabricator.wikimedia.org/T84923#1499169 (10GWicke) [18:01:24] (03CR) 10coren: [C: 031] "LGTM, with a safe-to-ignore nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/228179 (https://phabricator.wikimedia.org/T106590) (owner: 10Yuvipanda) [18:01:52] jynus, we don't need a replacement for x1. We want a separate External Store. It doesn't matter whether it's physically separate hardware, as long as it's logically separate so trackBlobs.php and recompressTracked.php will not affect the Flow External Store. [18:01:54] please involve Sean on that, I may not be too aware of the issue [18:01:58] andrewbogott: Ah, so that's an actual bug then. That's better - I thought the behaviour was the suck [18:02:24] I don’t like that this feature is broken in two different ways in two out of three kernels I’ve tried. [18:02:26] why a different physical es? [18:02:34] If it’s broken in an old one and a new one and only works in the middle one… that doesn’t bode well [18:02:41] cannot you write on a separate db? [18:02:54] andrewbogott: Hm. It /is/ disturbing. [18:03:07] we will probably have 12TB in total of space [18:03:17] https://www.irccloud.com/pastebin/TEdnYE3R/ [18:03:28] maybe I am not understanding it, matt_flaschen [18:04:12] jynus, it doesn't have to be different physically, as long as it's logically a separate External Store setup. [18:04:13] ok [18:04:13] Coren: does ^ look like anything to you? [18:04:13] now we are talking [18:04:13] jynus, the issue is that trackBlobs.php and recompressTracked.php are not suitable for Flow, and rather than making Flow fit into them we would prefer to have a separate logical ES cluster, since Flow has different characteristics. [18:04:25] then it is a pure DBA help what you need [18:04:32] ok, we can help with that [18:04:59] andrewbogott: It's trying to mount a network block device, and there is nothing to mount there. I'm surprised it's even trying, but I don't expect that not finding a filesystem on a device that's not used is an issue. [18:05:15] ok, we’ll see if the vms work [18:05:50] my only issue, which I wrote is that I am not the right person to comment on arch changes of code I do not fully understand [18:05:54] however, if you need help on "moving this rows away" I will be glad to help [18:06:30] matt_flaschen, does that help? [18:08:01] jynus, we also need confirmation that you can set up a separate logical cluster (not important to me what physical hardware it's on). [18:08:01] jynus, I will CC springle at https://phabricator.wikimedia.org/T106363 . If you can comment there, it would also be helpful. [18:08:01] ok, for that I need numbers [18:08:02] as in "rows, size and usage" [18:08:13] or how to get them "this correspond to table xyz" [18:08:24] !log revert to openjdk8 and restart cassandra on restbase1003 [18:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:08:33] !log made User:Flow talk page manager a 'bot' on all wikis (except loginwiki) [18:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:08:43] can you setup a separate task or setup a meeting? matt_flaschen [18:08:51] jynus, okay, I think we can get you the rows/size at least. Not sure how to get usage, and not sure exactly how the load balancer and logical/physical mapping work. [18:08:59] I can do that [18:09:06] but I need of what :-) [18:09:24] sorry, as an op, I do not have the full picture [18:09:40] but if you tell me where to search, I definitelly can help [18:09:48] (03CR) 10Dzahn: "to save energy, and if we reuse it for something else it usually changes hostname" [puppet] - 10https://gerrit.wikimedia.org/r/227997 (owner: 10Dzahn) [18:09:59] matt_flaschen, what is your timezone? [18:10:36] jynus, Eastern (Philadelphia/New York), but I tend to work a later schedule (starting 12/1ish). [18:10:51] ok, so lets do this [18:11:09] lets setup a meeting next week [18:11:16] I also don't have the full picture, but I think we can work it out. [18:12:04] so I can fully understand what is the data subset (prepare pointers to code/table structures) [18:12:32] jynus, what timezone are you in? [18:12:36] and then I will bring it to Sean/other ops to discuss what is the optimal way to solve it [18:12:44] I am CEST [18:12:51] Madrid/Berlin [18:13:18] if you prefer it [18:13:27] Sean is at australia [18:13:40] whatever works better with you [18:14:23] What city is he in? [18:14:52] I am assamed to say I do not remember [18:15:15] 6operations, 6Labs, 3Labs-Sprint-107, 3ToolLabs-Goals-Q4: Investigate kernel issues on labvirt** hosts - https://phabricator.wikimedia.org/T99738#1499192 (10Andrew) Update: 3.19 kernels don't crash when I suspend/resume, but the VMs don't come up properly; their clocks are seriously broken such that a sim... [18:15:16] (03PS4) 10Yuvipanda: labstore: Fixup start-nfs for new storage layout [puppet] - 10https://gerrit.wikimedia.org/r/228179 (https://phabricator.wikimedia.org/T106590) [18:15:17] coren you were right! updated [18:15:21] (03CR) 10jenkins-bot: [V: 04-1] labstore: Fixup start-nfs for new storage layout [puppet] - 10https://gerrit.wikimedia.org/r/228179 (https://phabricator.wikimedia.org/T106590) (owner: 10Yuvipanda) [18:15:21] it's not really reasonable times for all three of us, but it's doable if we really wanted to. [18:15:36] (03PS5) 10Yuvipanda: labstore: Fixup start-nfs for new storage layout [puppet] - 10https://gerrit.wikimedia.org/r/228179 (https://phabricator.wikimedia.org/T106590) [18:15:39] !log multatuli - installing package upgrades [18:15:45] I can stay late one day, not an issue [18:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:15:53] but not today [18:16:13] (03CR) 10Dzahn: "as long as it's in puppet it will also be in monitoring and we can't shut it down, so it will just sit there and also needs package upgrad" [puppet] - 10https://gerrit.wikimedia.org/r/227997 (owner: 10Dzahn) [18:16:49] jynus, is 4 PM Tuesday your time okay? [18:16:52] 6operations, 6Discovery, 10Wikidata, 10Wikidata-Query-Service: Assign an LVS service to the wikidata query service - https://phabricator.wikimedia.org/T107601#1499196 (10Smalyshev) > Is the querying stateful in any way? No. > Do we expect to need to route different urls to different servers? No. How... [18:19:17] that should do [18:21:09] jynus, you're also the same time zone as Matthias, who is the main person working on this. So I'll definitely invite him. [18:21:37] please do [18:21:44] greg: define intentional [18:22:25] so I will ask you, so you have it prepared: what data, where is it, and what you want to do logically [18:22:45] if .gitmodules has a branch reference then gerrit does the automerge. The presence of that branch reference in .gitmodules is dependent on which version of git you have when you run make-wmf-branch [18:22:56] so not really entirely intentional but I think it's desirable [18:22:58] RECOVERY - puppet last run on cp3031 is OK Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:23:02] jynus, yep, I'll prep. We both invited each other, so I'll delete mine. [18:23:03] I will figure out where is the best place to do it from a DBA/ops point of view (machines, replication, etc.) [18:23:37] (03CR) 10coren: [C: 031] labstore: Fixup start-nfs for new storage layout [puppet] - 10https://gerrit.wikimedia.org/r/228179 (https://phabricator.wikimedia.org/T106590) (owner: 10Yuvipanda) [18:24:36] matt_flaschen, matthias what, you can pm me [18:25:04] (03CR) 10Yuvipanda: [C: 032] labstore: Fixup start-nfs for new storage layout [puppet] - 10https://gerrit.wikimedia.org/r/228179 (https://phabricator.wikimedia.org/T106590) (owner: 10Yuvipanda) [18:26:04] Yeah, Matthias Mullie, he is usually mlitn on IRC. [18:26:07] ok, bye for now [18:26:24] Coren, Yuvi, the upshot is: 3.16 works fine, and I’m also furious. [18:26:25] don't be naughty [18:26:52] andrewbogott: why furious [18:27:10] Because 3.13 and 3.19 are both broken for this one feature, and in different ways. [18:27:50] Which makes me think that a) kernel devs or ubuntu dev have no regression-testing regimen and b) we are running the only actual production openstack cloud in the world [18:28:10] Or, I guess, c) these HP servers are just super broken [18:28:17] andrewbogott: Either way, it means we have to be extra careful with kernel switches. [18:29:02] I know! [18:29:15] #bringtheciscosback? :P [18:29:27] I’m having one of those “I can’t believe the world works at all” days [18:29:41] Actually, I bet I can run these same tests on a cisco to see if it’s a hardware interaction. [18:30:04] cmjohnson1: virt100x are still racked, right? Mind if I run some tests on virt1001? [18:30:19] andrewbogott: they're gone [18:30:24] ok, nevermind then :) [18:31:00] 6operations, 6Collaboration-Team, 10Collaboration-Team-Sprint-F-Finishing-Move-2015-08-04, 10Flow: Setup separate External Store for Flow - https://phabricator.wikimedia.org/T107610#1499219 (10Mattflaschen) 3NEW a:3Mattflaschen [18:31:30] PROBLEM - puppet last run on labstore1002 is CRITICAL puppet fail [18:31:35] looking [18:31:45] (03PS1) 10Cmjohnson: Adding dhcp details for analtyics1046-49 [puppet] - 10https://gerrit.wikimedia.org/r/228311 [18:32:50] (03CR) 10Cmjohnson: [C: 032] Adding dhcp details for analtyics1046-49 [puppet] - 10https://gerrit.wikimedia.org/r/228311 (owner: 10Cmjohnson) [18:32:56] (03PS1) 10Yuvipanda: labstore: Remove autorestarting nfs-exports daemon [puppet] - 10https://gerrit.wikimedia.org/r/228314 [18:33:02] (03CR) 10jenkins-bot: [V: 04-1] labstore: Remove autorestarting nfs-exports daemon [puppet] - 10https://gerrit.wikimedia.org/r/228314 (owner: 10Yuvipanda) [18:33:56] (03PS2) 10Yuvipanda: labstore: Remove autorestarting nfs-exports daemon [puppet] - 10https://gerrit.wikimedia.org/r/228314 [18:35:13] (03CR) 10Yuvipanda: [C: 032] labstore: Remove autorestarting nfs-exports daemon [puppet] - 10https://gerrit.wikimedia.org/r/228314 (owner: 10Yuvipanda) [18:36:17] !log revert to openjdk8 and restart cassandra on restbase1004 [18:36:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:37:17] 6operations: syslog-ng and rsyslog jousting on lithium - https://phabricator.wikimedia.org/T107611#1499239 (10ori) [18:37:39] RECOVERY - puppet last run on labstore1002 is OK Puppet is currently enabled, last run 43 seconds ago with 0 failures [18:39:19] 6operations, 6Collaboration-Team, 10Collaboration-Team-Sprint-F-Finishing-Move-2015-08-04, 10Flow: Setup separate External Store for Flow - https://phabricator.wikimedia.org/T107610#1499254 (10jcrespo) "@jcrespo indicated this should be fine, as long as it doesn't require new hardware, which it does not"... [18:43:21] !log restarted phd on iridium. I had to forcefully kill one stuck repository worker to get the daemons to restart properly. [18:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:44:06] !log oddly, the symptom was that there were logs about apc cache entries that had been on the GC queue for too long, I guess this is due to phd being stuck [18:44:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:44:59] (03CR) 10Dzahn: "confirmed. you are right. the rule for port 514 is there and i see traffic with netstat" [puppet] - 10https://gerrit.wikimedia.org/r/227697 (owner: 10Muehlenhoff) [18:45:22] (I'm not sure why overdue garbage collection would warrant a 503 error but it's plausible) [18:46:13] (03PS3) 10Dzahn: Enable base::firewall on lithium [puppet] - 10https://gerrit.wikimedia.org/r/227697 (owner: 10Muehlenhoff) [18:46:16] HTTP/503 Garbage bin full [18:47:48] PROBLEM - Disk space on labvirt1007 is CRITICAL: DISK CRITICAL - free space: /var/lib/nova/instances 91437 MB (3% inode=99%) [18:48:28] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [18:49:01] andrewbogott: ^^ [18:49:30] YuviPanda: dang, ok. [18:50:20] 6operations, 10ops-eqiad: Decom and wipe cisco virt servers virt1001-1009 then remove from racks - https://phabricator.wikimedia.org/T107159#1499270 (10RobH) a:3Cmjohnson [18:51:24] (03CR) 10Dzahn: [C: 032] Enable base::firewall on lithium [puppet] - 10https://gerrit.wikimedia.org/r/227697 (owner: 10Muehlenhoff) [18:52:28] 6operations, 3Labs-Sprint-104, 3Labs-Sprint-105, 3Labs-Sprint-107: Setup/Install/Deploy labnet1002 - https://phabricator.wikimedia.org/T99701#1499274 (10RobH) a:5RobH>3None I'm not sure what the status is on this, other than the task was assigned to me. Is the new NIC not working? [18:55:33] moritzm: may I delete any of these instances? https://dpaste.de/Q8CG [18:55:43] (‘no’ is a fine answer, I’m just freeing up some space) [18:59:58] 6operations, 10ops-eqiad: Decom and wipe cisco virt servers virt1001-1009 then remove from racks - https://phabricator.wikimedia.org/T107159#1499292 (10Cmjohnson) The wiping process in progress [19:00:19] ottomata: 4 new AN boxes are ready for you to install [19:00:35] analytics1046-49 [19:02:10] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [19:02:41] !log revert to openjdk8 and restart cassandra on restbase1005 [19:02:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:02:52] YES!!!! [19:03:02] cmjohnson1: that is exactly what I want to do rigih tnow, thanks! [19:03:05] 6operations, 10ops-eqiad, 10Analytics-Cluster, 5Patch-For-Review: rack new hadoop worker nodes - https://phabricator.wikimedia.org/T104463#1499295 (10Cmjohnson) 1046-1049 are racked and setup in row B and ready for installs DNS/DHCP/Raid Cfg/Switch Cfg has been completed [19:03:06] perfect timing :) [19:03:21] ottomata: lmk if you experience problems with an1049 [19:03:24] cmjohnson1: so I just need to PXE boot, yet? [19:03:28] yeah [19:03:28] ori: is the 'hhvm-test’ instance still being actively used? [19:03:34] everything else is done [19:03:37] oh , btw, cmjohnson1, hyperthreading is on for these? [19:03:41] should have asked before the other 4 [19:03:51] HT enabled is default now [19:03:56] great. [19:03:57] ori: I’m hungry for disk space on its host, it’s using 150g. [19:08:08] andrewbogott: puppet-jmm-salt-client02 can go away (I was under the impression I had it removed in wikitech last week), but I need the others as test systems for debdeploy or the firewall stuff [19:08:19] moritzm: ok, thanks! [19:09:22] (03PS1) 10coren: nrpe: Merge check_systemd_unit_lastrun into _state [puppet] - 10https://gerrit.wikimedia.org/r/228329 [19:09:47] (03PS1) 10Dzahn: syslog-ng: remove srange for now [puppet] - 10https://gerrit.wikimedia.org/r/228331 [19:10:21] (03PS2) 10Dzahn: syslog-ng: remove srange for now [puppet] - 10https://gerrit.wikimedia.org/r/228331 [19:12:37] 6operations, 6Collaboration-Team, 10Collaboration-Team-Sprint-F-Finishing-Move-2015-08-04, 10Flow: Setup separate External Store for Flow - https://phabricator.wikimedia.org/T107610#1499315 (10Mattflaschen) [19:12:43] (03CR) 10Muehlenhoff: [C: 032 V: 032] syslog-ng: remove srange for now [puppet] - 10https://gerrit.wikimedia.org/r/228331 (owner: 10Dzahn) [19:13:20] 6operations, 6Collaboration-Team, 10Collaboration-Team-Sprint-F-Finishing-Move-2015-08-04, 10Flow: Setup separate External Store for Flow - https://phabricator.wikimedia.org/T107610#1499219 (10Mattflaschen) >>! In T107610#1499254, @jcrespo wrote: > "@jcrespo indicated this should be fine, as long as it doe... [19:18:51] 6operations, 6Collaboration-Team, 10Collaboration-Team-Sprint-F-Finishing-Move-2015-08-04, 10Flow: Setup separate logical External Store for Flow - https://phabricator.wikimedia.org/T107610#1499339 (10Mattflaschen) [19:21:05] !log revert to openjdk8 and restart cassandra on restbase1006 [19:21:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:23:22] PROBLEM - Disk space on labvirt1007 is CRITICAL: DISK CRITICAL - free space: /var/lib/nova/instances 75597 MB (3% inode=99%) [19:28:01] PROBLEM - Host mobile-lb.eqiad.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [19:28:09] cmjohnson1: racktables not update, ja/ these are in B3? [19:28:10] PROBLEM - Restbase endpoints health on cerium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:28:10] PROBLEM - Restbase endpoints health on restbase1007 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:28:31] PROBLEM - Restbase endpoints health on restbase1006 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:28:51] PROBLEM - salt-minion processes on lvs1004 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [19:29:11] RECOVERY - Host mobile-lb.eqiad.wikimedia.org is UPING OK - Packet loss = 0%, RTA = 2.10 ms [19:29:13] ottomata: yes in B3, ..I plan to do racktables later....i wanted to get them install ready for you [19:29:27] np, just need to note it in the hadoop network topology [19:29:33] what's going on? [19:29:36] ok [19:29:43] got paged too for mobile-lb [19:29:54] (03PS1) 10Ottomata: Provision analytics1046-1049 as Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/228343 (https://phabricator.wikimedia.org/T104463) [19:29:57] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&c=API+application+servers+eqiad&m=cpu_report&s=by+name&mc=2&g=network_report ? [19:30:04] same here, now have the page saying it's up [19:30:20] (03PS1) 10Muehlenhoff: Various cleanups [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/228344 [19:30:22] http://grafana.wikimedia.org/#/dashboard/db/restbase?panelId=12&fullscreen [19:30:22] (03PS2) 10Ottomata: Provision analytics1046-1049 as Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/228343 (https://phabricator.wikimedia.org/T104463) [19:30:30] RECOVERY - Restbase endpoints health on cerium is OK: All endpoints are healthy [19:30:31] RECOVERY - Restbase endpoints health on restbase1007 is OK: All endpoints are healthy [19:30:32] PHP API latency seems to have skyrocketed [19:30:51] RECOVERY - Restbase endpoints health on restbase1006 is OK: All endpoints are healthy [19:30:57] I also have some mysterious pages that read "alrt #17, 318, #19... on ops-gmtplus" not very informative [19:31:26] 19:21 < godog> !log revert to openjdk8 and restart cassandra on restbase1006 ? [19:31:33] surely not this broad impact though? [19:31:40] (03CR) 10Ottomata: [C: 032] Provision analytics1046-1049 as Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/228343 (https://phabricator.wikimedia.org/T104463) (owner: 10Ottomata) [19:31:57] bblack: I hope not [19:31:58] bblack: no, that's very unlikely to affect anything but RB itself [19:32:20] PROBLEM - salt-minion processes on lvs1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [19:32:21] huge graph impacts on text caches eqiad, but not esams [19:32:26] also appservers, api appservers, etc [19:32:34] also, why is salt-minion alerting at the same time, on LVS? [19:33:05] RB traffic was down as well [19:33:10] LVS issues? [19:33:20] Jul 31 19:24:32 lvs1001 kernel: [15990017.911851] warn_alloc_failed: 40 callbacks suppressed [19:33:23] Jul 31 19:24:33 lvs1001 kernel: [15990017.911858] SLUB: Unable to allocate memory on node -1 (gfp=0x20) [19:33:26] Jul 31 19:24:33 lvs1001 kernel: [15990017.911870] IPVS: ip_vs_conn_new(): no memory [19:33:31] I'm seeing that on lvs1004 too [19:33:33] Jul 31 19:24:33 lvs1001 kernel: [15990017.911881] SLUB: Unable to allocate memory on node -1 (gfp=0x20) [19:33:35] salt got killed for mem [19:33:36] Jul 31 19:24:33 lvs1001 kernel: [15990017.911882] IPVS: ip_vs_conn_new(): no memory [19:34:33] Jul 31 19:24:15 lvs1004 ntpd[7980]: i/o error on routing socket No buffer space available - disabling [19:35:05] ^ seems to be the first indicator on 1004 [19:36:44] * YuviPanda is here but not of much use [19:39:19] * YuviPanda got pages, is standing by [19:39:46] haha, uhhh, cmjohnson1 [19:39:57] root@analtyics1046:~# hostname [19:39:57] analtyics1046 [19:40:02] not sure how it got that... [19:40:08] ohdns probably... [19:40:14] ah yes ha [19:40:18] hm, maybe I can just reboot it then? [19:40:22] after fixing dns? [19:40:55] (03PS1) 10Ottomata: Fix typo: analtyics1046 -> analytics1046 [dns] - 10https://gerrit.wikimedia.org/r/228349 [19:41:17] (03CR) 10Ottomata: [C: 032 V: 032] Fix typo: analtyics1046 -> analytics1046 [dns] - 10https://gerrit.wikimedia.org/r/228349 (owner: 10Ottomata) [19:41:30] (I manually created partitions for this node, wasn't sure what was wrong at the time) [19:42:31] PROBLEM - Disk space on labvirt1007 is CRITICAL: DISK CRITICAL - free space: /var/lib/nova/instances 78149 MB (3% inode=99%) [19:44:26] (03CR) 10Muehlenhoff: [C: 032 V: 032] Various cleanups [debs/debdeploy] - 10https://gerrit.wikimedia.org/r/228344 (owner: 10Muehlenhoff) [19:45:42] RECOVERY - salt-minion processes on lvs1004 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:47:45] ottomata: hah..yeah thx for fixing [19:47:53] YuviPanda: Merged, fixed and python'ed: https://gerrit.wikimedia.org/r/#/c/228329/ [19:48:27] hm, cmjohnson1, i think i need to clear the PTR for it somehow [19:48:38] after reboot it still got the bad name, even though I did authdns-update [19:49:48] cmjohnson1: according to [19:49:49] https://wikitech.wikimedia.org/wiki/DNS#Remove_a_record_from_the_DNS_resolver_caches [19:49:49] dns update has to propagate ...did you do a host analytics1046 first to make sure it's updated? [19:50:23] I did a dig of each nameserver [19:50:24] its there [19:50:27] but, it is a different record [19:50:30] so it won't invalidate the old one [19:50:33] coren nice! I can check for style issues but I saw a big discussion between you, jo.e and b.black about how to do this [19:50:40] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL 1.69% of data above the critical threshold [1000.0] [19:50:42] i gues? [19:50:51] RECOVERY - salt-minion processes on lvs1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:50:53] yeah so do rec_control wipe-cache [19:50:54] i'm not 100% on how the DHCP stuff assigns the hostname [19:50:58] i forgot about that [19:51:03] cmjohnson1: not found on ns0 [19:51:26] !log ori Synchronized php-1.26wmf16/includes/EditPage.php: More debug logging for T102199 (duration: 00m 12s) [19:51:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:53:26] ottomata: bib [19:54:15] cmjohnson1: ? [19:54:37] YuviPanda: Yeah, that's the result of that discussion; using 'systemctl show' vs 'journalctl' [19:54:56] !log revert to openjdk8 and restart cassandra on restbase1007 [19:55:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:55:03] ottomata: i have to run out ..back shortly [19:55:07] coren ah I see. no way to get that in a json or some other format? [19:55:31] YuviPanda: Nope. journalctl did json, but systemctl doesn't. (Yeah, silly). [19:55:33] !log ori Synchronized php-1.26wmf16/includes/User.php: More debug logging for T102199 (duration: 00m 13s) [19:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:55:41] heh [19:56:01] YuviPanda: It's also less usful because it requires a timer set up; but this way doesn't require elevated privileges. [19:56:15] for just the active ones or? [19:56:31] i mean, does it change the 'unit is running' check to need it to be a timer? [19:56:57] YuviPanda: No, the "must be a timer" is only a requirement for checking the last run data on an inactive unit [19:57:34] YuviPanda: And the check is constructed so that it doesn't even try that unless you're expecting 'periodic' *and* the unit is inactive [19:58:00] right [20:00:26] robh: yt? how can I clear a dns entry? rec_control doesn't exist on ns* hosts [20:02:03] ottomata: if you mean ns0, ns1, ns2 .wikimedia.org, those are not caches, they're authservers [20:02:31] rec_control is for the pdns caches that resolv.conf uses [20:02:34] ah, hm. [20:03:06] what's the issue? [20:03:16] well it can wait I guess, I want to look at LVS more too [20:03:37] yeah, it can wait. i'm installing a new node and the DHCP is picking upa previously committed typo for the PTR [20:03:43] so its getting a bad hostname on boot [20:03:50] but ja, no hurry [20:06:05] bblack: you can see the bad hostname by doing dig -x 10.64.21.105 [20:06:53] ottomata: use @server to dig that at various servers. you'll probably find ns[012] are fine. check caches. [20:07:04] the caches are where you need to clear it with rec_control [20:07:09] e.g. hydrogen, etc [20:07:36] hm, can't log into caches? [20:08:16] they're normal hosts... [20:08:28] 6operations, 3Labs-Sprint-104, 3Labs-Sprint-105, 3Labs-Sprint-107: Setup/Install/Deploy labnet1002 - https://phabricator.wikimedia.org/T99701#1499455 (10Andrew) a:3Andrew The box is up and running. Getting it to actually do useful Nova things is up to me now. [20:08:55] 7Blocked-on-Operations, 6operations, 10Parsoid, 6Services: Offer io.js on Jessie - https://phabricator.wikimedia.org/T91855#1499457 (10GWicke) > I have imported 1.8.4 for jessie-wikimedia (iojs and iojs-dbg) Ohh, thank you! Small nit: The current stable version is 2.5.0: https://deb.nodesource.com/iojs_... [20:09:11] hm, it hink its my ssh proxy settings, hang on [20:09:31] ja got it [20:10:08] lloking better, rebootting... [20:11:24] !log revert to openjdk8 and restart cassandra on restbase1008 [20:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:13:11] gwicke: oh, I thought you wanted the 1.x branch, since it was originally mentioned in the ticket, I'll import 2.5 on Monday, then [20:14:57] !log ori Synchronized php-1.26wmf16/includes/objectcache/ObjectCacheSessionHandler.php: Uncommitted revert of I4afaecd to test impact on T102199 (duration: 00m 12s) [20:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:17:14] (03PS1) 10BBlack: ipvs: some rudimentary defenses via sysctl [puppet] - 10https://gerrit.wikimedia.org/r/228401 [20:18:00] (03PS3) 10Mattflaschen: Enable Flow on all wikis, except private and a couple special wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226954 (https://phabricator.wikimedia.org/T106562) [20:18:14] matt_flaschen: are you planning to deploy that today? [20:18:26] ori, no, it still needs product approval. [20:18:33] Please don't [20:18:44] good, thanks [20:18:49] Partly because it includes Commons which exploded when I tried to enable Flow there yesterday [20:18:58] i have a number of uncommitted live hacks in prod to get to the bottom of T102199 [20:19:06] lego says that should be fixed now, but I'm not willing to test that on a Friday [20:19:22] (03CR) 10Mattflaschen: [C: 04-2] "Do not deploy without coordination with Collaboration team." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226954 (https://phabricator.wikimedia.org/T106562) (owner: 10Mattflaschen) [20:19:38] Hmm I still have a live hack in Flow left over from yesterday's debugging [20:19:39] RoanKattouw, I'm not trying to deploy it today, just finishing up the patch. [20:19:47] Good :) [20:20:17] I wasn't sure whether Ori was about to be accommodating or about to ask you not to, so I figured I'd get there first ;) [20:21:13] (03CR) 10Mattflaschen: "(This needs both technical and product approval)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226954 (https://phabricator.wikimedia.org/T106562) (owner: 10Mattflaschen) [20:22:42] RoanKattouw, the Commons explosion was due to Flow talk page manager? [20:22:51] Yeah [20:23:27] matt_flaschen: https://phabricator.wikimedia.org/T107301#1499186 [20:26:02] (03PS4) 10Mattflaschen: Enable Flow on all wikis, except private and a couple special wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226954 (https://phabricator.wikimedia.org/T106562) [20:26:56] PROBLEM - DPKG on labvirt1005 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:27:01] (03CR) 10Mattflaschen: "This is blocked on I3f01fa40fcb364382caddad268d1d90a4d37ad9a being on all WMF wikis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/226954 (https://phabricator.wikimedia.org/T106562) (owner: 10Mattflaschen) [20:27:17] RECOVERY - Disk space on labvirt1007 is OK: DISK OK [20:27:47] PROBLEM - DPKG on analtyics1046 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [20:28:33] moritzm: thanks a bunch! [20:28:42] and have a nice weekend! [20:29:48] RECOVERY - DPKG on analtyics1046 is OK: All packages OK [20:36:53] aude: I scheduled it for monday's swat [20:37:20] (03PS1) 10Ottomata: Fix typo in hadoop net-topology json [puppet] - 10https://gerrit.wikimedia.org/r/228403 [20:37:25] (03CR) 10jenkins-bot: [V: 04-1] Fix typo in hadoop net-topology json [puppet] - 10https://gerrit.wikimedia.org/r/228403 (owner: 10Ottomata) [20:37:33] (03PS2) 10Ottomata: Fix typo in hadoop net-topology json [puppet] - 10https://gerrit.wikimedia.org/r/228403 [20:38:12] ah, I see that discussion happened right after you pinged me [20:38:32] (03PS3) 10Ottomata: Fix typo in hadoop net-topology json [puppet] - 10https://gerrit.wikimedia.org/r/228403 [20:38:52] (03CR) 10Ottomata: [C: 032 V: 032] Fix typo in hadoop net-topology json [puppet] - 10https://gerrit.wikimedia.org/r/228403 (owner: 10Ottomata) [20:43:19] gilles: ok [20:43:27] just so you know, it's on tin already [20:43:32] but not deployed [20:44:37] RECOVERY - carbon-cache too many creates on graphite1001 is OK Less than 1.00% above the threshold [500.0] [20:47:32] (03CR) 10Dzahn: [C: 031] "deploying Monday" [puppet] - 10https://gerrit.wikimedia.org/r/225041 (https://phabricator.wikimedia.org/T105981) (owner: 10Glaisher) [20:49:53] (03PS2) 10BBlack: ipvs: some rudimentary defenses via sysctl [puppet] - 10https://gerrit.wikimedia.org/r/228401 [20:49:55] (03CR) 10Dzahn: "what about 8140?" [puppet] - 10https://gerrit.wikimedia.org/r/226501 (owner: 10Muehlenhoff) [20:51:07] (03PS1) 10EBernhardson: Prevent caching of search requests partitipating in AB test [puppet] - 10https://gerrit.wikimedia.org/r/228404 (https://phabricator.wikimedia.org/T106888) [20:53:01] (03PS2) 10EBernhardson: Prevent caching of search requests partitipating in AB test [puppet] - 10https://gerrit.wikimedia.org/r/228404 (https://phabricator.wikimedia.org/T106888) [20:53:07] PROBLEM - Disk space on labvirt1007 is CRITICAL: DISK CRITICAL - free space: /var/lib/nova/instances 71838 MB (3% inode=99%) [20:59:05] (03CR) 10BBlack: [C: 032] ipvs: some rudimentary defenses via sysctl [puppet] - 10https://gerrit.wikimedia.org/r/228401 (owner: 10BBlack) [21:01:15] (03CR) 10Dzahn: [C: 031] "7 <% if @server_type == 'backend' or @server_type == 'frontend' -%>" [puppet] - 10https://gerrit.wikimedia.org/r/226501 (owner: 10Muehlenhoff) [21:02:46] 6operations, 10Analytics-Cluster, 6Analytics-Kanban, 5Patch-For-Review: Build new latest stable (0.8.2.1?) Kafka package and upgrade Kafka brokers - https://phabricator.wikimedia.org/T106581#1499614 (10Ottomata) analytics1046-1049 are online as of today! I have started the decommission process of analytic... [21:05:03] (03PS2) 10Muehlenhoff: Add ferm rules for puppet master backends [puppet] - 10https://gerrit.wikimedia.org/r/226501 [21:09:10] (03CR) 10Dzahn: "noop until base::firewall" [puppet] - 10https://gerrit.wikimedia.org/r/226501 (owner: 10Muehlenhoff) [21:09:20] (03PS3) 10Dzahn: Add ferm rules for puppet master backends [puppet] - 10https://gerrit.wikimedia.org/r/226501 (owner: 10Muehlenhoff) [21:10:43] 6operations, 6Discovery, 10Traffic, 10Wikidata, 10Wikidata-Query-Service: Set up a public interface to the wikidata query service - https://phabricator.wikimedia.org/T107602#1499654 (10Smalyshev) > what hostname would we use? query.wikidata.org Yes, looks like it from discussion with wikidata team. >... [21:13:00] (03CR) 10Catrope: [C: 031] Convert wmgLiquidThreadsBackfill to wmgLiquidThreadsFrozen [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228192 (https://phabricator.wikimedia.org/T107068) (owner: 10Mattflaschen) [21:13:05] YuviPanda: can you plz add me to mobile-smoketests project.. kindof emergency [21:15:20] rmoen: done [21:15:28] andrewbogott: thanks [21:15:49] andrewbogott: thanks [21:15:59] andrewbogott: thanks [21:38:03] (03PS1) 10JanZerebecki: Add query.wikidata.org [dns] - 10https://gerrit.wikimedia.org/r/228411 (https://phabricator.wikimedia.org/T107602) [21:38:11] (03CR) 10jenkins-bot: [V: 04-1] Add query.wikidata.org [dns] - 10https://gerrit.wikimedia.org/r/228411 (https://phabricator.wikimedia.org/T107602) (owner: 10JanZerebecki) [21:43:16] PROBLEM - torrus.wikimedia.org HTTP on netmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - string Torrus Top: Wikimedia not found on http://torrus.wikimedia.org:80/torrus - 289 bytes in 0.301 second response time [21:43:28] PROBLEM - Disk space on labvirt1008 is CRITICAL: DISK CRITICAL - free space: /var/lib/nova/instances 90424 MB (3% inode=99%) [21:53:04] 6operations, 6Discovery, 10Traffic, 10Wikidata, and 2 others: Set up a public interface to the wikidata query service - https://phabricator.wikimedia.org/T107602#1499866 (10Legoktm) >>! In T107602#1499826, @gerritbot wrote: > Add query.wikidata.org CentralAuth cookies are currently set for ".wikidata.org"... [22:05:46] 6operations, 10Wikimedia-General-or-Unknown, 7network: Implement RPKI (Resource Public Key Infrastructure) - https://phabricator.wikimedia.org/T61115#1499900 (10Tgr) [22:08:48] 6operations, 10Traffic, 10Wikimedia-DNS, 5Patch-For-Review: DNS request for wikimedia.org - https://phabricator.wikimedia.org/T107060#1499914 (10CCogdill_WMF) Now that we've selected a new domain that is more specific to fundraising (benefactorevents.wikimedia.org), can we add the DNS records as requested?... [22:09:17] 6operations, 10Wikimedia-Site-requests: Run "refreshLinks.php --dfn-only" on all wikis periodically - https://phabricator.wikimedia.org/T18112#1499918 (10PleaseStand) [22:35:16] 6operations, 7network: investigate ethernet errors: asw2-a5-eqiad port xe-0/0/36 - https://phabricator.wikimedia.org/T107635#1499988 (10BBlack) 3NEW [22:35:41] 6operations, 7network: investigate ethernet errors: asw2-a5-eqiad port xe-0/0/36 - https://phabricator.wikimedia.org/T107635#1499997 (10BBlack) [22:37:32] 6operations, 6Discovery, 10Traffic, 10Wikidata, and 2 others: Set up a public interface to the wikidata query service - https://phabricator.wikimedia.org/T107602#1499998 (10Smalyshev) The service does not need to access them but I'm not sure how we can avoid them being sent... Maybe have some varnish rule... [22:45:43] jdlrobson: about? [22:45:51] chasemp: hey yup [22:45:56] (03PS1) 10Mforns: Reenable reporting for mobile-reportcard [puppet] - 10https://gerrit.wikimedia.org/r/228420 (https://phabricator.wikimedia.org/T104379) [22:47:01] https://phabricator.wikimedia.org/p/BarryTheBrowserTestBot/ is making tons of weird empty pastes adn being spammy https://phabricator.wikimedia.org/paste/ [22:47:11] yeh already sorted [22:47:30] ah ok I disabled it at some point [22:47:33] there was a rogue script context https://phabricator.wikimedia.org/T107549 [22:48:04] i'm keeping an eye on it [22:48:15] https://www.mediawiki.org/wiki/Phabricator/Bots [22:48:34] could we use a phab bot account so it's easier to track down? [22:48:54] alternatively could you put your details on taht user profile or who owns it [22:49:17] Noted. I will do that now. [22:50:50] where is that running? [22:50:57] I mostly ask as you are a member of https://phabricator.wikimedia.org/project/members/61/ [22:51:27] and the pastes show as created by you? ah it's just that one [22:51:31] probably the last test for fix [22:57:01] chaseemp updated. process should be shut down now however so please do let me know if you see any more pastes [23:03:12] gwicke: is the ‘htmldump’ instance in the ‘services’ project in active use? (Nothing personal, I’m asking everybody things like this today.) [23:03:50] not very active, but it's also not dead [23:03:58] you can stop it if you want to save memory [23:04:17] we could then re-start it on demand [23:10:27] RECOVERY - Disk space on labvirt1007 is OK: DISK OK [23:11:40] great job, nova! You rescheduled an instance off of a server with a disk space alarm and made a disk space alarm go off on a different server! [23:12:02] (03PS1) 10Andrew Bogott: Try to get nova not to schedule instances on totally full servers. [puppet] - 10https://gerrit.wikimedia.org/r/228422 [23:13:08] (03CR) 10Andrew Bogott: [C: 032] Try to get nova not to schedule instances on totally full servers. [puppet] - 10https://gerrit.wikimedia.org/r/228422 (owner: 10Andrew Bogott) [23:21:29] (03PS1) 10Andrew Bogott: remove_unused_base_images => True [puppet] - 10https://gerrit.wikimedia.org/r/228425 [23:25:01] chasemp: are the phab-xx labs instances still useful? [23:25:18] andrewbogott: phab-01 is used and by the phab guys as an alpha [23:25:29] the rest are various ppls some wmde and some idk like Negative24's? [23:25:32] ok [23:25:38] I don't have a bead on them all at this point [23:25:46] that’s fine, I’ll leave them be. [23:26:07] I just nuked a phab08 I think for cron spam as old w/ wmde [23:26:16] so idk what they are actually using [23:26:28] I’m stressed because I tried to rebalance a full virt node and it rebalanced onto a different node that is now so full that I can’t move things off of it :( [23:26:42] Because for some reason step one of migration is ‘make a copy of the VM’s drive' [23:26:49] bah suck [23:27:17] And the new migration code is all “you can’t specify the target because the scheduler will automatically make a good decision" [23:27:31] which in this case involved moving a 151G instance onto a server that had 151.5G of free space [23:28:23] (03PS1) 10GWicke: Set up a listing page for /api/ in all projects [puppet] - 10https://gerrit.wikimedia.org/r/228426 (https://phabricator.wikimedia.org/T107086) [23:29:36] gwicke: unforunately I’m short on disk space, not memory. So I need something that I can actually delete. [23:30:13] andrewbogott: if you are stuck and phab-01 saves you we can rebuild it man [23:30:30] chasemp: thanks. It’s not really big enough anyway I don’t think. [23:30:48] there is chasetest project you could nuke all of if needed [23:33:58] andrewbogott: in that case, you can nuke the dumps vm; we can always rebuild it [23:34:28] gwicke: really? That will totally pull me out of this fire. [23:34:29] Thank you! [23:34:52] andrewbogott: yw! [23:34:55] Just to double-check, we’re still talking about htmldump in ‘services’? [23:35:00] yes [23:35:04] awesome, thanks. [23:35:18] I’m tweaking the scheduler so when/if you create it it will /probably/ put it in a more sensible place. [23:37:56] RECOVERY - Disk space on labvirt1008 is OK: DISK OK [23:38:37] gwicke: ^ is all thanks to you :) [23:40:42] ACKNOWLEDGEMENT - DPKG on labvirt1005 is CRITICAL: DPKG CRITICAL dpkg reports broken packages andrew bogott I will investigate, but this is a test system so not critical. [23:41:17] andrewbogott: those were easily earned brownie points [23:41:47] glad to hear it :) [23:42:30] ;) [23:47:09] (03PS1) 10GWicke: Add an API listing template to the allowed templates in extract2.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228429 (https://phabricator.wikimedia.org/T107086) [23:47:37] (03CR) 10GWicke: [C: 04-2] "-2 until that page is protected." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228429 (https://phabricator.wikimedia.org/T107086) (owner: 10GWicke) [23:52:12] (03CR) 10GWicke: "The target page is now protected, so this is ready to go." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228429 (https://phabricator.wikimedia.org/T107086) (owner: 10GWicke) [23:55:41] (03PS2) 10GWicke: Set up a listing page for /api/ in all projects [puppet] - 10https://gerrit.wikimedia.org/r/228426 (https://phabricator.wikimedia.org/T107086)