[00:00:13] We're running into the late SWAT, but I'd suggest to just hijack that [00:00:18] * hoo will be swatting, then [00:00:26] mutante, so apparently we would need to write a custom plugin to reasonably parse this JSON^ [00:00:43] hoo: i think swat is done an hour ago [00:00:53] MaxSem: ok, let's see, i'm checking on neon [00:00:53] oh, looks like it [00:00:58] nothing much [00:01:00] just takes a while [00:01:04] in that case, let's just enable client on the wikis [00:01:06] hello guys [00:01:07] :) [00:01:12] ok [00:05:50] hoo: https://www.wikidata.org/wiki/Q3938 :) [00:06:14] Nice :) [00:06:25] aude: Should these wikis have languageLinkSiteGroup => wikipedia [00:06:50] i think so [00:07:28] (03PS3) 10Dzahn: Removed mgmt DNS for virt20[0-1][1-9], pc200[1-3], labsdb200[1-3] and WMF5709 [dns] - 10https://gerrit.wikimedia.org/r/247480 (https://phabricator.wikimedia.org/T115372) (owner: 10Papaul) [00:07:40] * aude recalls issues with the interwiki links on commons that required this setting [00:10:00] yeah, it means to show wikipedia interwiki links on commons [00:10:13] (03PS1) 10Hoo man: Enable WikibaseClient on mediawikiwiki, metawiki and specieswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247764 (https://phabricator.wikimedia.org/T115653) [00:10:18] aude: ^ [00:10:19] (03CR) 10jenkins-bot: [V: 04-1] Enable WikibaseClient on mediawikiwiki, metawiki and specieswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247764 (https://phabricator.wikimedia.org/T115653) (owner: 10Hoo man) [00:10:23] Please review carefully [00:10:26] (03PS4) 10Dzahn: Removed mgmt DNS for virt20[0-1][1-9], pc200[1-3], labsdb200[1-3] and WMF5709 [dns] - 10https://gerrit.wikimedia.org/r/247480 (https://phabricator.wikimedia.org/T115372) (owner: 10Papaul) [00:10:30] and assume that's what we want also here [00:10:32] * aude looks [00:11:06] that's why [00:11:07] :P [00:11:36] (03PS2) 10Hoo man: Enable WikibaseClient on mediawikiwiki, metawiki and specieswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247764 (https://phabricator.wikimedia.org/T115653) [00:11:47] there is a syntax error [00:11:58] (03CR) 10Dzahn: [C: 032] Removed mgmt DNS for virt20[0-1][1-9], pc200[1-3], labsdb200[1-3] and WMF5709 [dns] - 10https://gerrit.wikimedia.org/r/247480 (https://phabricator.wikimedia.org/T115372) (owner: 10Papaul) [00:12:48] aude: Yeah, fixed in PS2 [00:12:50] (03CR) 10Dzahn: "all Cisco and mgmt" [dns] - 10https://gerrit.wikimedia.org/r/247480 (https://phabricator.wikimedia.org/T115372) (owner: 10Papaul) [00:13:05] PROBLEM - check google safe browsing for wikiquote.org on google is CRITICAL: HTTP CRITICAL: HTTP/1.1 302 Found - string Safe Browsing has not rece... not found on https://www.google.com:443/safebrowsing/diagnostic?site=wikiquote.org/ - 800 bytes in 0.079 second response time [00:13:16] i think the patch is ok now [00:13:18] heh [00:14:05] we just have to remember when they get data access, to ensure they get arbitrary access to start with [00:14:23] or commons will get it first and then it will just be the default [00:14:58] (03CR) 10Aude: [C: 031] Enable WikibaseClient on mediawikiwiki, metawiki and specieswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247764 (https://phabricator.wikimedia.org/T115653) (owner: 10Hoo man) [00:16:29] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: adding tjones to analytics-privatedata-users (hive and webrequests) - https://phabricator.wikimedia.org/T115880#1740749 (10Dzahn) @EBernhardson Thank you for checking that. Ok, this was also confirmed by Otto on the gerrit change. Everything looks good... [00:16:34] hoo: my only concern is maybe wikispecies or such might not want wikipedia interwiki links [00:16:54] aude: mh [00:16:54] if they complain, we can change but otherwise think should be same as commons [00:17:10] their main page has no interwiki links [00:17:19] aude: They already have Wikipedia links [00:17:23] but the idea is to give them that [00:17:24] so don't think they'll mind [00:17:26] ok [00:17:28] see https://species.wikimedia.org/wiki/Chordata [00:17:29] :) [00:17:35] 6operations, 5Patch-For-Review: Google Safe Browsing Monitoring turned CRIT - https://phabricator.wikimedia.org/T116099#1740752 (10Dzahn) still not really working after switch to https, won't find the string [00:17:42] Ok, let's do that \o/ [00:17:49] then no concerns :) [00:18:37] (03CR) 10Hoo man: [C: 032] Enable WikibaseClient on mediawikiwiki, metawiki and specieswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247764 (https://phabricator.wikimedia.org/T115653) (owner: 10Hoo man) [00:18:43] (03Merged) 10jenkins-bot: Enable WikibaseClient on mediawikiwiki, metawiki and specieswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247764 (https://phabricator.wikimedia.org/T115653) (owner: 10Hoo man) [00:19:46] !log hoo@tin Synchronized wmf-config/: Enable WikibaseClient on mediawikiwiki, metawiki and specieswiki (duration: 00m 19s) [00:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:19:53] 6operations: Old salt grains not removed if a role changes - https://phabricator.wikimedia.org/T115983#1740763 (10Dzahn) p:5Triage>3Normal [00:19:59] \o/ [00:20:56] hoo: is everything correct? [00:20:59] or done? [00:21:23] 6operations, 10Traffic, 7HTTPS: status.wikimedia.org is using SSL cert from other domain - https://phabricator.wikimedia.org/T34796#1740773 (10Dzahn) >>! In T34796#1737605, @hashar wrote: > Do we really care of having `status.wikimedia.org` to be served over TLS? yes [00:21:23] doh [00:21:27] https://meta.wikimedia.org/wiki/Special:Version [00:21:30] forget to sync the dblists [00:21:33] ok [00:21:56] Was already wondering [00:22:04] !log hoo@tin Synchronized dblists/: Enable WikibaseClient on mediawikiwiki, metawiki and specieswiki (duration: 00m 17s) [00:22:06] yeah [00:22:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:22:15] here we go: https://www.mediawiki.org/wiki/Quality_Assurance [00:22:15] looks fine now [00:22:16] https://www.mediawiki.org/wiki/Project:Sandbox [00:22:23] should unlink that one, though [00:23:06] and [[File:{{#property:P18|from=Q147}}|350px]] doesn't work on my meta user page :/ [00:23:14] as intended [00:23:18] (03PS1) 10Ori.livneh: ~ori: add `branchdir` script [puppet] - 10https://gerrit.wikimedia.org/r/247766 [00:23:20] aude: :D [00:23:25] How's that a bad thing? [00:23:36] 6operations, 10Traffic, 7HTTPS: status.wikimedia.org is using SSL cert from other domain - https://phabricator.wikimedia.org/T34796#1740782 (10Dzahn) a:5Dzahn>3None [00:23:40] * aude wants global user page kittenC! [00:23:41] ! [00:23:46] but not today [00:24:21] mw.wikibase.label( 'Q42' ) [00:24:21] Lua error in console input at line 7: attempt to index field 'wikibase' (a nil value). [00:24:37] So that's fine as well [00:24:41] great [00:25:02] maybe now update the phabricator tickets [00:25:04] ? [00:25:13] Yeah, will close them :)) [00:25:40] maybe let lydia announce on project chat? [00:25:55] Definitely [00:26:01] k [00:26:07] * aude sleeps now [00:26:30] oh yes... sorry for keeping you awake :( [00:26:38] nah, it's good :) [00:26:47] * aude can take care of geodata tomorrow, if you like [00:26:58] I'll probably be in meetings all day [00:27:02] ok [00:27:03] so that would be nice [00:27:14] no nearby or anything yet [00:27:21] woohoo [00:27:32] time to wikidata everything :D [00:27:38] :D [00:27:52] yea, do the org chart in wikidata [00:28:01] assign Q to all people :p [00:28:05] aude: Shall we wait until tomorrow or shall I quickly put it up on the project chat? [00:28:06] do "part of" for teams [00:28:10] (Wait for Lydia) [00:28:18] hoo: either way [00:28:23] * aude doesn't think lydia would mind [00:28:31] she can send a mail, tweet, etc [00:28:34] Ok, will let them know then [00:28:41] k [00:28:47] thanks for your help with all this [00:29:28] !log legoktm wikidata'd everything [00:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:30:14] Don't abuse the bot\! [00:30:42] Escaping quotation marks is hard. [00:31:04] aude: ugh https://www.wikidata.org/wiki/Special:DispatchStats [00:31:18] Can hackily fix that, though [00:31:35] Or we can just wait [00:31:39] but will take a while [00:33:59] oh, dispatching is so fast that it catches up already [00:34:00] nice [00:38:20] 6operations, 5Patch-For-Review: Google Safe Browsing Monitoring turned CRIT - https://phabricator.wikimedia.org/T116099#1740806 (10Dzahn) should this use the real API ? https://developers.google.com/safe-browsing/lookup_guide https://sb-ssl.google.com/safebrowsing/api/lookup?... [00:59:04] (03PS1) 10Dzahn: ganglia: add ssl cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247771 [01:00:56] (03PS2) 10Dzahn: ganglia: add ssl cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247771 [01:00:59] I have someone asking to update the interwiki cache [01:01:06] but I didn't do that in a while [01:01:32] (03PS2) 10Dzahn: protactinium: mark as role spare [puppet] - 10https://gerrit.wikimedia.org/r/247578 [01:01:36] (03CR) 10Dzahn: [C: 032] ganglia: add ssl cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247771 (owner: 10Dzahn) [01:02:18] (03PS3) 10Dzahn: protactinium: mark as role spare [puppet] - 10https://gerrit.wikimedia.org/r/247578 [01:02:48] (03CR) 10Dzahn: [C: 032] protactinium: mark as role spare [puppet] - 10https://gerrit.wikimedia.org/r/247578 (owner: 10Dzahn) [01:03:01] !log hoo@tin Synchronized wmf-config/interwiki.cdb: Updating interwiki cache (duration: 00m 17s) [01:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:03:34] !log hoo@tin Synchronized wmf-config/interwiki.cdb: revert (duration: 00m 18s) [01:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:03:40] meh [01:04:38] $default_all_dblist = getRealmSpecificFilename( "/srv/mediawiki/all.dblist" ); [01:04:45] I don't think that works anymore [01:06:12] is that from dumpInterwiki? [01:06:15] yeah [01:06:30] Luckily I saved the old version beforehand [01:08:19] indeed [01:08:44] (03PS2) 10Dzahn: interface: do not 'ensure latest',do require_package [puppet] - 10https://gerrit.wikimedia.org/r/247008 (https://phabricator.wikimedia.org/T115348) [01:09:04] (03CR) 10Dzahn: "done. using require_package now" [puppet] - 10https://gerrit.wikimedia.org/r/247008 (https://phabricator.wikimedia.org/T115348) (owner: 10Dzahn) [01:11:26] (03PS2) 10Dzahn: puppet: do not 'ensure latest', do require_package [puppet] - 10https://gerrit.wikimedia.org/r/247007 (https://phabricator.wikimedia.org/T115348) [01:12:12] !log hoo@tin Synchronized wmf-config/interwiki.cdb: Updating interwiki cache (duration: 00m 17s) [01:13:04] Manually passed in the lists via parameters now [01:15:16] We should fix that so that the next person to run updateinterwikicache doesn't break all the links [01:15:42] yeah, patch on the way [01:16:43] Krenair: ori: https://gerrit.wikimedia.org/r/247772 [01:18:06] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds [01:18:50] hoo, there's a couple of others to fix as well [01:18:56] flaggedrevs-periodic-update.sh [01:18:58] makeSizeDBLists.php [01:19:02] removeDeletedWikis.php [01:19:10] maybe this one in SecurePoll: cli/wm-scripts/bv2015/doSpam.php [01:19:34] (03CR) 10Dzahn: maps: move hieradata from codfw to role/common (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/246992 (owner: 10Dzahn) [01:19:38] (03PS2) 10Dzahn: maps: move hieradata from codfw to role/common [puppet] - 10https://gerrit.wikimedia.org/r/246992 [01:19:48] Krenair: *sigh*, ok [01:20:50] (03PS3) 10Dzahn: maps: move hieradata from codfw to role/common [puppet] - 10https://gerrit.wikimedia.org/r/246992 [01:23:16] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 6 below the confidence bounds [01:23:49] Krenair: Do you plan to handle these? [01:23:58] I think I'll head off in a short bit [01:24:03] I'll do them, sure [01:24:10] Nice! :) [01:24:16] The WikimediaMaintenance ones anyway [01:24:29] Don't know if Jamesofur cares about the SecurePoll bv2015 stuff [01:25:32] I like to keep them around tbh at least until the next election but mostly so that they can be copied and tweaked. They should never be used again as is [01:26:00] sure, so should they be updated with the new working paths? [01:26:15] I'm not planning to run them myself :p [01:28:07] hoo, Reedy, ori: https://gerrit.wikimedia.org/r/247773 [01:28:27] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [01:28:27] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 6 below the confidence bounds [01:28:46] ty [01:31:14] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [01:47:21] Krenair: probably best if easy :) [01:47:28] k [01:48:16] Jamesofur, I take it you don't care about bv2013 anymore [01:48:24] correct [01:48:27] can be deleted tbh [01:48:30] for all I care :) [01:58:14] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 8 below the confidence bounds [02:10:33] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [02:19:29] 6operations, 10ops-codfw, 5Patch-For-Review: power off Codfw-Cisco Servers - https://phabricator.wikimedia.org/T115372#1740935 (10Dzahn) 12:19 < mutante> papaul: are all the cisco servers shut down? 12:20 < papaul> no 12:20 < papaul> there are stay up 12:20 < papaul> doing the wipe 12:20 < mutante> but you d... [02:34:35] Krinkle: {{done}} [02:37:49] !log l10nupdate@tin Synchronized php-1.27.0-wmf.2/cache/l10n: l10nupdate for 1.27.0-wmf.2 (duration: 08m 05s) [02:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:42:44] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.2) at 2015-10-21 02:42:44+00:00 [02:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:09:07] !log l10nupdate@tin Synchronized php-1.27.0-wmf.3/cache/l10n: l10nupdate for 1.27.0-wmf.3 (duration: 08m 24s) [03:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:13:43] !log l10nupdate@tin LocalisationUpdate completed (1.27.0-wmf.3) at 2015-10-21 03:13:42+00:00 [03:13:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:40:24] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [03:42:04] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [03:49:26] ... [03:49:28] I didn't say that [04:22:03] PROBLEM - puppet last run on mw2172 is CRITICAL: CRITICAL: puppet fail [04:51:32] RECOVERY - puppet last run on mw2172 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:52:46] (03CR) 10Dzahn: "@Alex, it seemed nice to me. we would like to override contact groups to give icinga permissions per service. we recently started with wdq" [puppet] - 10https://gerrit.wikimedia.org/r/244814 (owner: 10John F. Lewis) [04:58:37] (03CR) 10Dzahn: [C: 031] "i didn't test anything here, but i sure like the direction this goes, deleting the manual dsh group files" [puppet] - 10https://gerrit.wikimedia.org/r/247324 (owner: 10Chad) [05:02:56] 6operations, 5Patch-For-Review: Google Safe Browsing Monitoring turned CRIT - https://phabricator.wikimedia.org/T116099#1741072 (10Dzahn) "The API key format has changed. API keys are now managed in the Google Developers Console," who has that console and a key ?:) [05:03:48] 6operations: Google Safe Browsing Monitoring turned CRIT - https://phabricator.wikimedia.org/T116099#1741073 (10Dzahn) [05:04:58] 6operations: Google Safe Browsing Monitoring turned CRIT - https://phabricator.wikimedia.org/T116099#1741075 (10Dzahn) p:5Triage>3Normal i'll still say priority normal since this is broken monitoring (due to Google changing things on their side), not actual alarms that our sites have a problem [05:06:26] 6operations, 7Icinga, 7Monitoring: Google Safe Browsing Monitoring turned CRIT - https://phabricator.wikimedia.org/T116099#1741077 (10Dzahn) [05:06:59] 6operations, 7Icinga, 7Monitoring: Google Safe Browsing Monitoring turned CRIT (rewrite check using the real API) - https://phabricator.wikimedia.org/T116099#1741078 (10Dzahn) [05:09:20] 6operations, 7Icinga, 7Monitoring: Google Safe Browsing Monitoring turned CRIT (rewrite check using the real API) - https://phabricator.wikimedia.org/T116099#1741079 (10Dzahn) or anyone, can you fix https://gerrit.wikimedia.org/r/#/c/247760/2/modules/icinga/manifests/gsbmonitoring.pp with different options o... [05:13:24] PROBLEM - puppet last run on mw2002 is CRITICAL: CRITICAL: Puppet has 1 failures [05:30:09] 6operations, 7Database, 5Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#1741100 (10Dzahn) >>! In T111654#1737846, @jcrespo wrote: > * Recommended cipher and key length (I suppose 2048), that we use for other production services (I assume `ssl_cipher=TLSv1.2`,... [05:37:28] 6operations, 7Database, 5Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#1741108 (10Dzahn) >>! In T111654#1737846, @jcrespo wrote: > * (I assume `ssl_cipher=TLSv1.2 actually, a value TLSv1.2 i would expect for ssl_protocols vs. ssl_cipher. and then a list of... [05:39:32] RECOVERY - puppet last run on mw2002 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [05:44:03] 6operations, 7Database, 5Patch-For-Review: Set up TLS for MariaDB replication - https://phabricator.wikimedia.org/T111654#1741116 (10Dzahn) https://mariadb.com/kb/en/mariadb/mysql_get_ssl_cipher/ https://mariadb.com/kb/en/mariadb/mysql_ssl_set/ https://mariadb.com/kb/en/mariadb/ssl-status-variables/#ssl_ci... [05:45:31] 6operations, 10MediaWiki-extensions-CentralNotice, 7Database, 7Schema-change: Create CentralNotice campaign mixin tables - https://phabricator.wikimedia.org/T110963#1741122 (10awight) 5Open>3Resolved a:3awight Thank you! [05:49:03] PROBLEM - puppet last run on mw2123 is CRITICAL: CRITICAL: puppet fail [05:59:30] 6operations: Trigger some sort of alert if the memcache-serious log file is filling up at a greater than usual rate - https://phabricator.wikimedia.org/T95231#1741143 (10Dzahn) this sounds like something for check_graphite, right? YuvOri? [06:04:23] 6operations, 6Performance-Team, 10Traffic, 7Performance: Optimize prod's resource domains for SPDY/HTTP2 - https://phabricator.wikimedia.org/T94896#1741150 (10Dzahn) 5Open>3Resolved a:3Dzahn >>! In T94896#1567576, @BBlack wrote: > Well this basically got solved along the way while doing other things.... [06:06:05] 6operations, 10Traffic: missing SPDY coalesce for upload.wm.o for images ref'd in projects' page outputs - https://phabricator.wikimedia.org/T116132#1741154 (10Dzahn) 3NEW [06:07:36] 6operations, 10Traffic: missing SPDY coalesce for upload.wm.o for images ref'd in projects' page outputs - https://phabricator.wikimedia.org/T116132#1741166 (10Dzahn) [06:08:52] 6operations, 6Performance-Team, 10Traffic, 7Performance: Optimize prod's resource domains for SPDY/HTTP2 - https://phabricator.wikimedia.org/T94896#1741173 (10Dzahn) >>! In T94896#1567576, @BBlack wrote: > new one about eventually looking at the specific upload problem. T116132 [06:10:12] 6operations, 10Traffic: missing SPDY coalesce for upload.wm.o for images ref'd in projects' page outputs - https://phabricator.wikimedia.org/T116132#1741175 (10Dzahn) p:5Triage>3Low [06:12:02] PROBLEM - puppet last run on cp3005 is CRITICAL: CRITICAL: puppet fail [06:16:52] RECOVERY - puppet last run on mw2123 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:22:27] 6operations, 7Monitoring: give icinga a "login" link - https://phabricator.wikimedia.org/T82499#1741213 (10Dzahn) 5Open>3declined a:3Dzahn [06:22:41] 6operations, 7Icinga, 7Monitoring: give icinga a "login" link - https://phabricator.wikimedia.org/T82499#1741217 (10Dzahn) [06:26:06] 6operations, 7Icinga, 7Monitoring: re-create script for manual paging - https://phabricator.wikimedia.org/T82937#1741232 (10Dzahn) [06:26:07] 6operations, 7Icinga, 7Monitoring: re-create script for manual paging - https://phabricator.wikimedia.org/T82937#1741235 (10Dzahn) [06:30:02] PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: puppet fail [06:30:04] PROBLEM - puppet last run on mw1086 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:07] 6operations, 7Icinga, 7Mail: fix/streamline mail routing off of neon - https://phabricator.wikimedia.org/T80890#1741239 (10Dzahn) a:3faidon @Faidon this must be long time ago, right. can you confirm this should be closed? [06:30:08] 6operations, 7Icinga, 7Mail: fix/streamline mail routing off of neon - https://phabricator.wikimedia.org/T80890#1741242 (10Dzahn) [06:30:09] 6operations, 7Icinga, 7Mail: fix/streamline mail routing off of neon - https://phabricator.wikimedia.org/T80890#880945 (10Dzahn) [06:30:44] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:54] PROBLEM - puppet last run on lvs1003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:13] PROBLEM - puppet last run on eventlog2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:23] 6operations, 7Icinga, 7Mail: fix/streamline mail routing off of neon - https://phabricator.wikimedia.org/T80890#880945 (10Dzahn) on neon: 53 transport = remote_smtp 54 route_list = * mx1001.wikimedia.org:mx2001.wikimedia.org [06:31:32] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:43] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [06:31:53] PROBLEM - puppet last run on db2056 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:03] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:13] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:33] PROBLEM - puppet last run on mw2045 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:52] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:03] 6operations, 6Discovery: Fix CirrusSearch monitoring - https://phabricator.wikimedia.org/T84163#1741251 (10Dzahn) >>! In T84163#1647247, @Dzahn wrote: > I scheduled a downtime for this service of 1 month with a link to this ticket. Since that's over it's back as a WARNING for now: https://icinga.wikimedia.or... [06:38:43] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [06:39:52] RECOVERY - puppet last run on cp3005 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:41:04] 6operations, 10Beta-Cluster-Infrastructure, 7Shinken: Make the Shinken IRC alert bot use colors - https://phabricator.wikimedia.org/T113785#1741252 (10Dzahn) p:5Normal>3Low [06:43:34] 6operations, 10Beta-Cluster-Infrastructure, 7Shinken: Make the Shinken IRC alert bot use colors - https://phabricator.wikimedia.org/T113785#1741255 (10Dzahn) colors for icinga-wm as well. wikibugs has them. so yay [06:56:53] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:56:54] RECOVERY - puppet last run on lvs1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:06] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [07:08:16] (03CR) 10Muehlenhoff: [C: 031] puppet: do not 'ensure latest', do require_package [puppet] - 10https://gerrit.wikimedia.org/r/247007 (https://phabricator.wikimedia.org/T115348) (owner: 10Dzahn) [07:14:38] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10MediaWiki-extensions-ContentTranslation, and 7 others: Standardise CXServer deployment - https://phabricator.wikimedia.org/T101272#1741298 (10Amire80) [07:26:53] RECOVERY - puppet last run on eventlog2001 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [07:27:03] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:27:23] RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:27:32] RECOVERY - puppet last run on mw1086 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:27:32] RECOVERY - puppet last run on db2056 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [07:27:43] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [07:27:52] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:28:12] RECOVERY - puppet last run on mw2045 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:36:04] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Oct 21 07:36:04 UTC 2015 (duration 36m 3s) [07:36:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:41:41] ok, let's fix that replication issue [08:04:43] PROBLEM - mailman_queue_size on fermium is CRITICAL: CRITICAL: 1 mailman queue(s) above limits (thresholds: bounces: 25 in: 25 out: 300 virgin: 25) [08:16:13] (03CR) 10TTO: [C: 031] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247088 (https://phabricator.wikimedia.org/T114458) (owner: 10Luke081515) [08:20:12] RECOVERY - mailman_queue_size on fermium is OK: OK: mailman queues are below the limits. [08:32:48] !log restbase deploying 3006b77e [08:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:40:25] !log restbase deployment of 3006b77e {{done}} [08:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:51:13] PROBLEM - puppet last run on mw1083 is CRITICAL: CRITICAL: Puppet last ran 6 hours ago [09:01:53] (03PS2) 10Muehlenhoff: potassium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247220 [09:04:43] (03PS1) 10Alexandros Kosiaris: etherpad: align X-Real-IP logging configuration [puppet] - 10https://gerrit.wikimedia.org/r/247791 [09:07:52] (03CR) 10Muehlenhoff: [C: 032 V: 032] potassium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247220 (owner: 10Muehlenhoff) [09:09:51] (03PS2) 10Muehlenhoff: graphite1001: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247193 [09:11:04] (03CR) 10Muehlenhoff: [C: 032 V: 032] graphite1001: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247193 (owner: 10Muehlenhoff) [09:12:55] (03PS2) 10Muehlenhoff: graphite2001: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247194 [09:15:00] 6operations, 7Database: TokuDB crashes frequently -consider upgrade it or search for alternative engines with similar features - https://phabricator.wikimedia.org/T109069#1741669 (10jcrespo) p:5Normal>3High Last issue that broke replication on dbstore1001: ``` SELECT * FROM revision WHERE rev_page=9023902... [09:17:01] (03CR) 10Muehlenhoff: [C: 032 V: 032] graphite2001: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247194 (owner: 10Muehlenhoff) [09:20:02] (03PS2) 10Alexandros Kosiaris: etherpad: align X-Real-IP logging configuration [puppet] - 10https://gerrit.wikimedia.org/r/247791 [09:20:35] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] etherpad: align X-Real-IP logging configuration [puppet] - 10https://gerrit.wikimedia.org/r/247791 (owner: 10Alexandros Kosiaris) [09:21:05] (03PS2) 10Muehlenhoff: conf*: Convert to fully use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246967 [09:22:00] (03CR) 10Muehlenhoff: [C: 032 V: 032] conf*: Convert to fully use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246967 (owner: 10Muehlenhoff) [09:23:01] (03PS2) 10Muehlenhoff: hafnium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247195 [09:23:52] (03PS2) 10Jcrespo: Enabling performance schema experimentally on db1018 [puppet] - 10https://gerrit.wikimedia.org/r/247615 (https://phabricator.wikimedia.org/T99485) [09:24:35] (03CR) 10Jcrespo: [C: 032] Enabling performance schema experimentally on db1018 [puppet] - 10https://gerrit.wikimedia.org/r/247615 (https://phabricator.wikimedia.org/T99485) (owner: 10Jcrespo) [09:25:33] (03PS3) 10Muehlenhoff: hafnium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247195 [09:25:42] (03CR) 10Muehlenhoff: [C: 032 V: 032] hafnium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247195 (owner: 10Muehlenhoff) [09:27:04] (03PS2) 10Muehlenhoff: pc100*: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247218 [09:28:12] (03PS1) 10Jcrespo: Revert "Enabling performance schema experimentally on db1018" [puppet] - 10https://gerrit.wikimedia.org/r/247792 [09:28:20] (03PS2) 10Jcrespo: Revert "Enabling performance schema experimentally on db1018" [puppet] - 10https://gerrit.wikimedia.org/r/247792 [09:29:11] (03CR) 10Jcrespo: [C: 032] Revert "Enabling performance schema experimentally on db1018" [puppet] - 10https://gerrit.wikimedia.org/r/247792 (owner: 10Jcrespo) [09:29:20] (03CR) 10Muehlenhoff: [C: 032 V: 032] pc100*: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247218 (owner: 10Muehlenhoff) [09:29:32] (03PS3) 10Muehlenhoff: pc100*: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247218 [09:29:46] (03CR) 10Muehlenhoff: [C: 032 V: 032] pc100*: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247218 (owner: 10Muehlenhoff) [09:30:13] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: puppet fail [09:30:22] PROBLEM - puppet last run on db2044 is CRITICAL: CRITICAL: puppet fail [09:30:52] PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: puppet fail [09:31:31] I think I missed the mariadb module commit? [09:31:34] PROBLEM - puppet last run on db1072 is CRITICAL: CRITICAL: puppet fail [09:32:02] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [09:32:03] PROBLEM - puppet last run on db1039 is CRITICAL: CRITICAL: puppet fail [09:32:10] I do not know why it is on a separate repository [09:32:20] (03CR) 10Muehlenhoff: gadolinium: Use the role keyword (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/247191 (owner: 10Muehlenhoff) [09:32:24] PROBLEM - puppet last run on db1060 is CRITICAL: CRITICAL: puppet fail [09:32:27] (03PS2) 10Muehlenhoff: gadolinium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247191 [09:32:44] PROBLEM - puppet last run on db2016 is CRITICAL: CRITICAL: puppet fail [09:33:22] RECOVERY - puppet last run on db1072 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [09:33:23] PROBLEM - puppet last run on db1041 is CRITICAL: CRITICAL: puppet fail [09:33:32] (03PS9) 10Alexandros Kosiaris: etherpad: Move role into module [puppet] - 10https://gerrit.wikimedia.org/r/220085 [09:33:43] PROBLEM - puppet last run on db2049 is CRITICAL: CRITICAL: puppet fail [09:34:13] PROBLEM - puppet last run on db2051 is CRITICAL: CRITICAL: puppet fail [09:35:13] RECOVERY - puppet last run on db1041 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [09:36:03] RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [09:36:15] (03PS3) 10Muehlenhoff: gadolinium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247191 [09:36:36] (03PS1) 10Aklapper: Remove auth.login-message - not supported by upstream anymore [puppet] - 10https://gerrit.wikimedia.org/r/247793 (https://phabricator.wikimedia.org/T116142) [09:37:14] RECOVERY - puppet last run on db1039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:37:34] RECOVERY - puppet last run on db1060 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:38:02] RECOVERY - puppet last run on db2016 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [09:38:53] (03PS2) 10Muehlenhoff: terbium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247227 [09:38:54] RECOVERY - puppet last run on db2049 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [09:39:03] RECOVERY - puppet last run on db2044 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:39:24] RECOVERY - puppet last run on db2051 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [09:42:06] (03CR) 10Muehlenhoff: [C: 032 V: 032] terbium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247227 (owner: 10Muehlenhoff) [09:44:55] (03PS2) 10Muehlenhoff: erbium: Move to using the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246971 [09:46:22] (03PS2) 10Muehlenhoff: holmium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247198 [09:46:42] 6operations, 10Beta-Cluster-Infrastructure, 7Shinken: Make the Shinken IRC alert bot use colors - https://phabricator.wikimedia.org/T113785#1741714 (10hashar) Seems the notification commands are defined in puppet `modules/nagios_common/templates/notification_commands.cfg.erb` and simply append to a file that... [09:51:50] a [09:51:54] hello! [09:51:58] (03PS2) 10Muehlenhoff: labsdb100[1-3]: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247208 [09:52:23] I have a random salt question, is it possible to have the master keep track of minions and make sure a command is executed on all of them ? [09:52:24] (03CR) 10Muehlenhoff: [C: 032 V: 032] labsdb100[1-3]: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247208 (owner: 10Muehlenhoff) [09:56:09] nothing built-in, you can match the list of expected hosts against the hosts which returned a result (since hosts may be powered down etc.) [09:57:00] (03PS4) 10Alexandros Kosiaris: maps: move hieradata from codfw to role/common [puppet] - 10https://gerrit.wikimedia.org/r/246992 (owner: 10Dzahn) [09:57:49] (03PS2) 10Alexandros Kosiaris: ldap.conf: Remove openldap unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/246242 [09:57:56] (03CR) 10Alexandros Kosiaris: [C: 032] ldap.conf: Remove openldap unused parameters [puppet] - 10https://gerrit.wikimedia.org/r/246242 (owner: 10Alexandros Kosiaris) [09:58:00] moritzm: ok thanks :-} [09:59:14] and I also found out: salt-run manage.status [10:18:35] (03PS5) 10Alexandros Kosiaris: maps: move hieradata from codfw to role/common [puppet] - 10https://gerrit.wikimedia.org/r/246992 (owner: 10Dzahn) [10:18:41] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] maps: move hieradata from codfw to role/common [puppet] - 10https://gerrit.wikimedia.org/r/246992 (owner: 10Dzahn) [10:20:12] (03PS2) 10Muehlenhoff: stat1001: Fully use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247222 [10:20:41] (03CR) 10Muehlenhoff: [C: 032 V: 032] stat1001: Fully use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247222 (owner: 10Muehlenhoff) [10:26:05] (03PS2) 10Muehlenhoff: Move dnsrecursor to the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246244 [10:31:35] (03CR) 10Muehlenhoff: [C: 032 V: 032] Move dnsrecursor to the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246244 (owner: 10Muehlenhoff) [10:33:25] (03PS2) 10Muehlenhoff: db1069: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246969 [10:34:07] (03CR) 10Alexandros Kosiaris: "I understand the need to give permissions per service, in fact I support it fully, I am not sure what you mean by override however in the " [puppet] - 10https://gerrit.wikimedia.org/r/244814 (owner: 10John F. Lewis) [10:34:59] (03CR) 10Muehlenhoff: [C: 032 V: 032] db1069: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246969 (owner: 10Muehlenhoff) [10:35:31] (03PS2) 10Alexandros Kosiaris: hiera_lookup: Use sub instead of tr [puppet] - 10https://gerrit.wikimedia.org/r/246987 [10:35:38] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] hiera_lookup: Use sub instead of tr [puppet] - 10https://gerrit.wikimedia.org/r/246987 (owner: 10Alexandros Kosiaris) [10:37:10] (03PS2) 10Muehlenhoff: db1047: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246968 [10:37:49] (03CR) 10Muehlenhoff: [C: 032 V: 032] db1047: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246968 (owner: 10Muehlenhoff) [10:38:35] akosiaris: I'll puppet-merge your hiera_lookup change along? [10:41:39] (03PS2) 10Muehlenhoff: Use the role keyword for puppetmaster backends [puppet] - 10https://gerrit.wikimedia.org/r/247232 [10:41:45] moritzm: I was about to, ok thanks! [10:43:20] ok, merged [10:43:58] (03CR) 10Alexandros Kosiaris: "almost 1 year old ? still applicable ? perhaps abandon ?" [puppet] - 10https://gerrit.wikimedia.org/r/130296 (owner: 10ArielGlenn) [10:44:44] (03PS2) 10Alexandros Kosiaris: Move base::firewall include into the otrs role [puppet] - 10https://gerrit.wikimedia.org/r/245965 (owner: 10Muehlenhoff) [10:44:49] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Move base::firewall include into the otrs role [puppet] - 10https://gerrit.wikimedia.org/r/245965 (owner: 10Muehlenhoff) [10:52:29] (03PS2) 10Muehlenhoff: Use role keyword for dbstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/247230 [10:55:38] (03CR) 10Muehlenhoff: [C: 032 V: 032] Use role keyword for dbstore hosts [puppet] - 10https://gerrit.wikimedia.org/r/247230 (owner: 10Muehlenhoff) [11:01:02] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Looks good, -1 to block on https://gerrit.wikimedia.org/r/#/c/244436/ being merged first" [puppet] - 10https://gerrit.wikimedia.org/r/244884 (owner: 10Yurik) [11:01:32] (03PS2) 10Muehlenhoff: statistics::cruncher: Move standard and base::firewall includes into the role [puppet] - 10https://gerrit.wikimedia.org/r/247223 [11:01:48] (03CR) 10Muehlenhoff: [C: 032 V: 032] statistics::cruncher: Move standard and base::firewall includes into the role [puppet] - 10https://gerrit.wikimedia.org/r/247223 (owner: 10Muehlenhoff) [11:03:05] (03Abandoned) 10Muehlenhoff: Mark graphite1002 as testsystem [puppet] - 10https://gerrit.wikimedia.org/r/247234 (owner: 10Muehlenhoff) [11:08:41] (03PS3) 10Muehlenhoff: Use the role keyword for puppetmaster backends [puppet] - 10https://gerrit.wikimedia.org/r/247232 [11:10:56] (03CR) 10Muehlenhoff: [C: 032 V: 032] Use the role keyword for puppetmaster backends [puppet] - 10https://gerrit.wikimedia.org/r/247232 (owner: 10Muehlenhoff) [11:14:03] (03PS2) 10Muehlenhoff: Move the authdns servers to the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247235 [11:15:26] (03CR) 10Muehlenhoff: [C: 032 V: 032] Move the authdns servers to the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247235 (owner: 10Muehlenhoff) [11:20:39] (03PS7) 10Alexandros Kosiaris: maps: Add tileratorui service [puppet] - 10https://gerrit.wikimedia.org/r/244436 (https://phabricator.wikimedia.org/T116062) [11:22:16] (03PS8) 10Alexandros Kosiaris: maps: Add tileratorui service [puppet] - 10https://gerrit.wikimedia.org/r/244436 (https://phabricator.wikimedia.org/T116062) [11:40:43] (03PS1) 10Muehlenhoff: Assign salt grains for authdns [puppet] - 10https://gerrit.wikimedia.org/r/247812 [11:40:46] (03PS1) 10Muehlenhoff: Assign salt grains for nova manager [puppet] - 10https://gerrit.wikimedia.org/r/247813 [11:40:48] (03PS1) 10Muehlenhoff: Assign salt grains for planet [puppet] - 10https://gerrit.wikimedia.org/r/247814 [11:40:50] (03PS1) 10Muehlenhoff: Assign salt grains for horizon [puppet] - 10https://gerrit.wikimedia.org/r/247815 [11:40:52] (03PS1) 10Muehlenhoff: Assign salt grains for stat servers [puppet] - 10https://gerrit.wikimedia.org/r/247816 [11:48:02] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for authdns [puppet] - 10https://gerrit.wikimedia.org/r/247812 (owner: 10Muehlenhoff) [11:51:37] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for nova manager [puppet] - 10https://gerrit.wikimedia.org/r/247813 (owner: 10Muehlenhoff) [12:02:23] (03PS1) 10KartikMistry: cxserver: Add JWT token support [puppet] - 10https://gerrit.wikimedia.org/r/247819 (https://phabricator.wikimedia.org/T116134) [12:06:28] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for planet [puppet] - 10https://gerrit.wikimedia.org/r/247814 (owner: 10Muehlenhoff) [12:08:14] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for horizon [puppet] - 10https://gerrit.wikimedia.org/r/247815 (owner: 10Muehlenhoff) [12:09:44] akosiaris: why https://gerrit.wikimedia.org/r/#/c/247819/ - says can't merge? [12:10:01] Can MergeNo [12:10:46] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for stat servers [puppet] - 10https://gerrit.wikimedia.org/r/247816 (owner: 10Muehlenhoff) [12:10:54] PROBLEM - puppet last run on elastic1025 is CRITICAL: CRITICAL: Puppet has 1 failures [12:11:15] kart_: needs to be rebased [12:11:23] ah. [12:11:45] (03PS2) 10KartikMistry: cxserver: Add JWT token support [puppet] - 10https://gerrit.wikimedia.org/r/247819 (https://phabricator.wikimedia.org/T116134) [12:12:06] I thought my clone was up-to-date. [12:16:30] (03CR) 10Alexandros Kosiaris: "@yurik, we will need a tileratorui/deploy repo, otherwise this LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/244436 (https://phabricator.wikimedia.org/T116062) (owner: 10Alexandros Kosiaris) [12:24:40] (03PS1) 10Alexandros Kosiaris: hiera: fix corner case in role backend [puppet] - 10https://gerrit.wikimedia.org/r/247822 [12:26:55] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] hiera: fix corner case in role backend [puppet] - 10https://gerrit.wikimedia.org/r/247822 (owner: 10Alexandros Kosiaris) [12:33:31] (03CR) 10Alexandros Kosiaris: cxserver: Add JWT token support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/247819 (https://phabricator.wikimedia.org/T116134) (owner: 10KartikMistry) [12:35:04] moritzm: around? [12:36:27] yep [12:36:58] Could you take a re-look over https://github.com/wikimedia/operations-puppet/commit/fba5006123579909681ca10fdf176d4c8ad4e2f2 please, mainly the end statement you changed at the end [12:37:04] RECOVERY - puppet last run on elastic1025 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [12:37:29] Seems labs has lost the ability for sudo coming from ldap and that's the latest change I can see for it, just a hunch [12:38:28] And the statement you changed to me, seems to be where the sudo rules are coming from and the change seems irregular to the rest of the statement/existing cases in the file. [12:39:06] JohnFLewis: that patch is by akosiaris, so he can probably comment better on it, but that seems likely the cause [12:39:59] (03PS1) 10Alexandros Kosiaris: diamond: enable ntpd collector across the fleet [puppet] - 10https://gerrit.wikimedia.org/r/247823 [12:40:33] moritzm: oh! I got it confused with you because I saw a bunch of you merges above and below it in git history :) [12:40:40] JohnFLewis: BINDDN and BINDPW are not really honored by openldap [12:40:45] so that should have been a noop [12:40:51] akosiaris: not that - the last line you changed [12:40:53] what is it that is not working ? [12:41:15] +<% if @ldapincludes.include?('sudo') then %>SUDOERS_BASE <%= @ldapconfig["sudobasedn"] %><% end -%> ? [12:41:26] that only trims the last newline [12:43:15] akosiaris: could you look at the config file in production and see (for labs) the output? If it just trims the last newline, then I can't see why it would cause it to not work [12:43:16] JohnFLewis: what's the problem ? [12:43:32] I just sudoed in a labs machine [12:43:41] akosiaris: https://phabricator.wikimedia.org/T116148#1741869 [12:44:06] ok lemme check [12:44:14] It does seem an LDAP issue :) [12:44:15] Thanks [12:45:52] hmm seems like my sudoing is successfull cause I am ops [12:46:25] Which would not be through project ldap (or LDAP at all presumably) [12:46:50] actually it is [12:47:11] I don't see it in /etc/group [12:47:20] and it is indeed via LDAP. the cn=ops group [12:47:24] The change only affects per project it seems to me though [12:47:58] ah yes indeed [12:48:11] Which sort of makes it hard for you to test if you're inheriting it via ops LDAP [12:52:46] ok, I 'll revert that line just to be sure, since it really is difficult to check that one [12:53:12] wasn't that 7 days ago? [12:53:26] jynus: no, merged 2 hours ago [12:53:32] jynus: merged today, written 7 days ago [12:53:37] ah, ok, then makes sense [12:54:15] sorry, got accustomed to gerrit/git.wm, not to github any more [12:54:44] jynus: how can you be accustomed to git.wm.o, it's unstable :) [12:54:52] well... [12:55:02] I should have said phabricator [12:55:12] :) [12:55:14] but that is equally unstable at times :-) [12:55:28] probably because th DBA doesn't do his job [12:55:31] akosiaris: poke me when the change has been applied and I'll test it my end [12:55:54] jynus: not your fault if the software creates like 10 million connections for a single search ;) [12:56:15] (03PS1) 10Alexandros Kosiaris: ldap: partially revert fba5006 [puppet] - 10https://gerrit.wikimedia.org/r/247826 (https://phabricator.wikimedia.org/T116148) [12:56:34] JohnFLewis, "Fool me once shame on you, ..." [12:56:39] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] ldap: partially revert fba5006 [puppet] - 10https://gerrit.wikimedia.org/r/247826 (https://phabricator.wikimedia.org/T116148) (owner: 10Alexandros Kosiaris) [12:56:50] jynus: ah, okay :) [12:57:08] (03CR) 10Nikerabbit: cxserver: Add JWT token support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/247819 (https://phabricator.wikimedia.org/T116134) (owner: 10KartikMistry) [12:57:11] though wouldn't it be 'fool me 10 million times, shame on you, fool me 20 million times...' [13:00:50] (03PS1) 10Muehlenhoff: Drop salt grains from initial testing [puppet] - 10https://gerrit.wikimedia.org/r/247827 [13:00:52] (03PS1) 10Muehlenhoff: Assign salt grains for kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/247828 [13:00:54] (03PS1) 10Muehlenhoff: Assign salt grains for grafana [puppet] - 10https://gerrit.wikimedia.org/r/247829 [13:00:56] (03PS1) 10Muehlenhoff: Assign salt grains for elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/247830 [13:03:29] the other option could be the the issue has always been there and the previous commit has only "awaken it" [13:04:10] (03CR) 10Alexandros Kosiaris: cxserver: Add JWT token support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/247819 (https://phabricator.wikimedia.org/T116134) (owner: 10KartikMistry) [13:09:17] (03CR) 10Ottomata: [C: 031] eventlog1001: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246979 (owner: 10Muehlenhoff) [13:12:57] (03CR) 10Yurik: "Alex, can you make it re-use the same tilerator/deploy repo? The code should be identical, and it would make no sense to constantly push t" [puppet] - 10https://gerrit.wikimedia.org/r/244436 (https://phabricator.wikimedia.org/T116062) (owner: 10Alexandros Kosiaris) [13:13:11] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1742098 (10Ottomata) Since this explicitly asks for stat1003 and stat1002, atgo will also need to be in the `statistics-users` group. That... [13:14:41] (03CR) 10Ottomata: [C: 031] gadolinium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247191 (owner: 10Muehlenhoff) [13:24:11] (03PS1) 10Alexandros Kosiaris: ldap: revert the rest of fba5006 [puppet] - 10https://gerrit.wikimedia.org/r/247834 [13:24:43] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] ldap: revert the rest of fba5006 [puppet] - 10https://gerrit.wikimedia.org/r/247834 (owner: 10Alexandros Kosiaris) [13:28:50] (03PS6) 10BBlack: Mark incoming requests without cookies in x-analytics [puppet] - 10https://gerrit.wikimedia.org/r/244626 (https://phabricator.wikimedia.org/T114370) (owner: 10Nuria) [13:29:00] (03CR) 10BBlack: [C: 032 V: 032] Mark incoming requests without cookies in x-analytics [puppet] - 10https://gerrit.wikimedia.org/r/244626 (https://phabricator.wikimedia.org/T114370) (owner: 10Nuria) [13:29:52] akosiaris: yours ok to merge? mine is [13:30:03] (03CR) 10Ottomata: "1 nit, otherwise LGTM" (031 comment) [puppet/cdh] - 10https://gerrit.wikimedia.org/r/247758 (owner: 10Nuria) [13:30:41] bblack: yup [13:30:42] thnaks [13:30:44] thanks* [13:30:46] (03CR) 10Ottomata: "Actually, on 2nd thought, should this be configurable? Is it something all users of the puppet-cdh module will want to have turned on by " [puppet/cdh] - 10https://gerrit.wikimedia.org/r/247758 (owner: 10Nuria) [13:31:35] (03CR) 10Ottomata: [C: 031] erbium: Move to using the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246971 (owner: 10Muehlenhoff) [13:32:09] 6operations, 5Continuous-Integration-Scaling: install/deploy scandium as zuul merger (ci) server - https://phabricator.wikimedia.org/T95046#1742165 (10hashar) Note the zuul-merger process on scandium will need to be able to reach the Gearman server on gallium (production). [13:35:29] !log stopping and fixing replication on labsdb1004 (not in production) [13:35:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:35:37] (03PS1) 10Aude: Enable GeoData on beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247835 [13:35:39] (03PS1) 10Aude: Enable GeoData on test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247836 [13:35:41] (03PS1) 10Aude: Enable geosearch on test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247837 [13:36:12] 6operations: FeaturedFeedsWMF.php should be available via noc.wikimedia.org - https://phabricator.wikimedia.org/T116163#1742177 (10saper) 3NEW [13:40:33] (03Abandoned) 10Muehlenhoff: Enable ferm on hadoop master [puppet] - 10https://gerrit.wikimedia.org/r/237099 (owner: 10Muehlenhoff) [13:40:50] (03CR) 10Andrew Bogott: [C: 031] holmium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247198 (owner: 10Muehlenhoff) [13:41:24] aude, do you know that is the state of geodata on wikicommons: I see now "Also will need to enable GeoData on Wikidata, but separate step" ? [13:41:39] (03PS3) 10KartikMistry: cxserver: Add JWT token support [puppet] - 10https://gerrit.wikimedia.org/r/247819 (https://phabricator.wikimedia.org/T116134) [13:43:34] sorry, wikicommons is nothing, I meant wikidata [13:45:13] jynus: i would like to enable geodata for wikidata [13:45:29] (first for beta + test.wikidata, though) [13:45:30] I see that the table has been already created [13:45:35] yeah [13:45:45] but with the old definition [13:45:48] oh [13:46:11] which actually makes what I want to do not-an-issue [13:46:11] * aude looks [13:46:16] 6operations, 10OTRS: Upgrade OTRS to latest stable release (5.0 or later) - https://phabricator.wikimedia.org/T74109#1742240 (10Steinsplitter) [13:46:24] what do you want to do? [13:46:26] 6operations, 10OTRS: Upgrade OTRS to latest stable release (5.0 or later) - https://phabricator.wikimedia.org/T74109#771227 (10Steinsplitter) https://www.otrs.com/release-notes-otrs-5/ [13:46:28] same change on all wikis, including wikidata [13:46:32] ok [13:46:38] apply https://gerrit.wikimedia.org/r/#/c/180704/6 [13:46:50] I think that will not affect you [13:46:53] great [13:47:00] when were you planning to do that? [13:47:07] now :-) [13:47:11] ok :D [13:47:40] and with now I mean during the next 24 hours [13:47:52] but it doesn't block us [13:48:09] * aude not planning to populate geodata yet [13:48:20] I think not, I was asking to clarify the status of wikidata [13:48:24] k [13:48:46] if no one is deploying, i'd like to proceed some :) [13:49:05] I will test on testwiki first, though (unrelated to code deployment) [13:49:14] k [13:51:49] 10Ops-Access-Requests, 6operations: Grant Joseph and Dan deploy permissions on aqs100[1-3] - https://phabricator.wikimedia.org/T116169#1742271 (10Milimetric) 3NEW [13:52:09] bblack, hi, we are getting closer to maps db update, and will need a varnish invalidation ip soonish. How much work is it to create it? [13:53:07] (03PS2) 10Muehlenhoff: eventlog1001: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246979 [13:53:27] (03CR) 10Muehlenhoff: [C: 032 V: 032] eventlog1001: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246979 (owner: 10Muehlenhoff) [13:54:41] (03CR) 10Aude: [C: 032] Enable GeoData on beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247835 (owner: 10Aude) [13:54:48] (03Merged) 10jenkins-bot: Enable GeoData on beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247835 (owner: 10Aude) [13:55:49] (03PS3) 10Muehlenhoff: erbium: Move to using the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246971 [13:55:58] (03CR) 10Muehlenhoff: [C: 032 V: 032] erbium: Move to using the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246971 (owner: 10Muehlenhoff) [13:56:16] !log performing schema change on testwiki.geo_tags [13:56:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:57:13] (03PS4) 10Muehlenhoff: gadolinium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247191 [13:57:49] ^there was a 1 second-spike lag there, not good [13:57:58] :( [13:58:20] do not worry- we had much worse [13:58:27] yeah [13:58:37] geo_tags on enwiki etc. is a bit bigger [13:58:40] that is why I test it first :-) [13:58:53] would definitely be nice to have the schema change for wikidata first [13:59:03] before it is populated too much [13:59:12] I can do that [13:59:16] k [13:59:31] on an empty table, it should be unproblematic :) [13:59:48] all changes I do are 99.999% online [13:59:57] * aude nods [14:00:03] (03CR) 10Muehlenhoff: [C: 032 V: 032] gadolinium: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/247191 (owner: 10Muehlenhoff) [14:00:16] it is only a question of tuning speed vs performance [14:00:43] sometimes a 1 second lag would be prefered to 1 hour with poor performance [14:01:05] it is always evaluating risks- but I think you do not need to be told about that :-) [14:01:12] yeah [14:02:00] let me do the wikidatatest first, so the have the same structure [14:02:06] ok [14:02:21] * aude has geodata on beta wikidata now [14:03:04] aren't beta schema changes applied automatically? [14:03:15] they are [14:03:23] so that should be good already [14:03:36] but not sure about cirrus mapping changes [14:03:55] wasn't there a wikidatatestwiki or something, I never remember the actual name [14:04:21] testwikidatawiki [14:04:28] thank you [14:04:38] aude: did you run any maint script from cirrus? [14:04:49] you will agree with me that the name is not precisally trivial :-) [14:06:38] dcausse: for beta, i think it's updateSearchIndexConfig.php (but want to check) [14:06:54] i think the steps are enable geodata but geosearch disabled [14:07:00] update the config (not sure?) [14:07:05] then enable geosearch [14:07:32] aude: yes updateSearchIndexConfig.php should update the mapping, it should fail if the change is incompatible [14:07:37] ok [14:07:51] * aude is reasonably confident with how all this works :) [14:07:56] but still like to ask [14:08:21] !log applying schema change to testwikidatawiki (s3) and wikidatawiki(s5) [14:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:08:27] if it is not too demanding etc., we would like to do refreshlinks soon to update geotags and page images [14:08:37] not sure if that also results in search updates (and is safe?) [14:08:46] or we need to force update the search [14:09:22] dcausse: http://wikidata.beta.wmflabs.org/wiki/Q27946?action=cirrusdump looks good [14:09:56] (@todo to add support for type, dim and the other things, but this is ok to start) [14:10:03] ok, that should be it [14:10:14] thanks jynus [14:10:17] I will continue tomorrow to let people discover potential issues [14:10:17] aude: I would check the mapping from es, what's the elastic cluster? [14:10:33] dcausse: looking [14:10:34] before applying to the rest of the databases [14:10:45] jynus: k [14:11:31] aude: is it deployment-prep elastic? [14:11:56] dcausse: i see ddeployment-elastic05.deployment-prep.eqiad.wmflabs (and *06, *07, *08) [14:12:03] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [14:12:05] ok will check [14:12:07] deployment-elastic05.deployment-prep.eqiad.wmflabs [14:14:55] aude: looks good, I can see coordinate there (curl -s deployment-elastic05.deployment-prep.eqiad.wmflabs:9200/wikidatawiki_content_first/_mapping?pretty) [14:15:14] curl 'deployment-elastic05.deployment-prep.eqiad.wmflabs:9200/wikidatawiki_content_first/_mapping' looks good [14:15:23] yeah, also just checked :) [14:15:55] so if refreshLinks schedule cirrus updates data should be correctly indexed [14:15:58] (03CR) 10Muehlenhoff: [C: 04-1] "The role::labsdnsrecursor class isn't properly sourced, see the output from puppet compiler:" [puppet] - 10https://gerrit.wikimedia.org/r/247198 (owner: 10Muehlenhoff) [14:16:26] we should maybe schedule an optimize (at least check the number of deleted docs) [14:16:32] ok [14:17:15] (03PS2) 10Muehlenhoff: Drop salt grains from initial testing [puppet] - 10https://gerrit.wikimedia.org/r/247827 [14:17:34] (03CR) 10Muehlenhoff: [C: 032 V: 032] Drop salt grains from initial testing [puppet] - 10https://gerrit.wikimedia.org/r/247827 (owner: 10Muehlenhoff) [14:18:56] * aude trying refresh links locally again [14:19:14] after recreating my indices w/o coordinates [14:19:23] (03PS1) 10Alexandros Kosiaris: ldap: group sudo-ldap settings and comment them [puppet] - 10https://gerrit.wikimedia.org/r/247838 [14:20:13] (03PS2) 10Muehlenhoff: Assign salt grains for kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/247828 [14:22:10] moritzm: , sure! why just kafka1012? [14:22:56] ottomata: 1012 serves as the canary, the others are matched by the role [14:23:09] (unless some other broker would be a more suitable canary) [14:23:12] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] ldap: group sudo-ldap settings and comment them [puppet] - 10https://gerrit.wikimedia.org/r/247838 (owner: 10Alexandros Kosiaris) [14:23:17] (03PS2) 10Alexandros Kosiaris: ldap: group sudo-ldap settings and comment them [puppet] - 10https://gerrit.wikimedia.org/r/247838 [14:23:22] (03CR) 10Alexandros Kosiaris: [V: 032] ldap: group sudo-ldap settings and comment them [puppet] - 10https://gerrit.wikimedia.org/r/247838 (owner: 10Alexandros Kosiaris) [14:24:35] naw they are all the same [14:25:42] ok [14:30:37] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/247828 (owner: 10Muehlenhoff) [14:30:45] (03PS3) 10Muehlenhoff: Assign salt grains for kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/247828 [14:30:56] (03CR) 10Muehlenhoff: [V: 032] Assign salt grains for kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/247828 (owner: 10Muehlenhoff) [14:32:12] jouncebot: next [14:32:12] In 0 hour(s) and 27 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151021T1500) [14:32:30] (03CR) 10Aude: [C: 032] Enable GeoData on test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247836 (owner: 10Aude) [14:32:35] (03PS1) 10EBernhardson: Revert "Revert "Revert "Revert "Enable config for all three search clusters, but only write to eqiad"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247845 [14:32:37] (03Merged) 10jenkins-bot: Enable GeoData on test.wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247836 (owner: 10Aude) [14:34:12] (03PS2) 10EBernhardson: Revert "Revert "Revert "Revert "Enable config for all three search clusters, but only write to eqiad"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247845 [14:34:30] !log aude@tin Synchronized wmf-config/InitialiseSettings-labs.php: enabled geodata on beta wikidata (duration: 00m 18s) [14:34:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:34:45] (03PS2) 10Muehlenhoff: Assign salt grains for grafana [puppet] - 10https://gerrit.wikimedia.org/r/247829 [14:35:12] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [14:35:28] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for grafana [puppet] - 10https://gerrit.wikimedia.org/r/247829 (owner: 10Muehlenhoff) [14:35:33] had failure copying from mw1097 to mw1083 [14:36:13] * aude wonders if those are undergoing maintenance or have issues? [14:40:27] (03PS2) 10Muehlenhoff: Assign salt grains for elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/247830 [14:43:10] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grains for elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/247830 (owner: 10Muehlenhoff) [14:44:06] dcausse: appears that refreshlinks doesn't affect search update [14:44:14] :/ [14:44:17] (unless i was doing something wrong on my wiki) [14:44:33] aude: do you another entity which should have coord? [14:44:41] aude:* do you have another entity which should have coord? [14:44:55] i'm trying on my local wiki [14:45:05] (03PS2) 10Dzahn: archiva: use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246959 (owner: 10Muehlenhoff) [14:45:30] it's strange (unless you updated it manually) but Q27946 has coord inside ES [14:45:38] that's beta [14:45:43] curl -s 'deployment-elastic05.deployment-prep.eqiad.wmflabs:9200/wikidatawiki_content_first/page/_search?q=title.plain:Q27946&pretty' [14:45:44] i added the coordinates just now [14:45:50] so it's normal search update job [14:46:00] (03CR) 10Dzahn: [C: 032] archiva: use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246959 (owner: 10Muehlenhoff) [14:46:01] ok [14:46:09] * aude will have to go find some existing items on beta that should have coordinates [14:46:12] and then try refresh links [14:47:06] meanwhile, i have some issues with mw1083 [14:47:12] sync-common also fails [14:47:16] if refreshLinks does not schedule cirrus updates we'll have to do a full rebuild, according to Nik it's around 3-5 days for enwiki. I have no idea for wikidata :/ [14:47:45] (03PS1) 10Alexandros Kosiaris: service::node support overriding the repository [puppet] - 10https://gerrit.wikimedia.org/r/247846 [14:47:49] (03CR) 10Dzahn: [C: 04-1] "uh oh, but why does this happen here and we have no issues in all these other cases" [puppet] - 10https://gerrit.wikimedia.org/r/247198 (owner: 10Muehlenhoff) [14:47:50] :( [14:47:58] maybe we cna fix this [14:48:10] can take a look [14:48:18] ok [14:48:45] mutante: ‘uh oh’ is about as far as my thinking went as well :) [14:49:31] !log aude@tin Synchronized wmf-config/InitialiseSettings.php: Enable GeoData on test.wikidata (duration: 00m 18s) [14:49:36] * aude proceeds with geodata on test.wikidata but then should stop and figure out what's wrong with mw1083 [14:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:50:30] (03CR) 10Dzahn: "poolcounters should be last and double checked, can break a lot" [puppet] - 10https://gerrit.wikimedia.org/r/247225 (owner: 10Muehlenhoff) [14:51:56] (03CR) 10Dzahn: "looks good, but since the issue on holmium, we should proof it in compiler" [puppet] - 10https://gerrit.wikimedia.org/r/247217 (owner: 10Muehlenhoff) [14:57:05] (03PS2) 10Dzahn: fluorine: Use the role keyword [puppet] - 10https://gerrit.wikimedia.org/r/246980 (owner: 10Muehlenhoff) [14:57:18] (03CR) 10Dzahn: [C: 032] "http://puppet-compiler.wmflabs.org/1049/" [puppet] - 10https://gerrit.wikimedia.org/r/246980 (owner: 10Muehlenhoff) [14:58:07] 6operations, 7Icinga: ganeti: PROCS CRITICAL: 2 processes ... - https://phabricator.wikimedia.org/T116111#1742495 (10akosiaris) a:3akosiaris Hmm, so this never really triggers an alert because after a single recheck it is back to normal. Need to figure out what triggers the second process. It probably is not... [14:59:12] 6operations, 7Icinga: ganeti: PROCS CRITICAL: 2 processes ... - https://phabricator.wikimedia.org/T116111#1742497 (10Dzahn) eh, fair, i just happened to see it in web ui. it was probably always just a SOFT state and did not trigger a notification. yes [15:00:01] !log rsync failed on mw1083: "failed to set times on "/srv/mediawiki/wmf-config": Read-only file system" [15:00:04] anomie ostriches thcipriani marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151021T1500). Please do the needful. [15:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:00:31] i am not sure that is something i can fix [15:00:40] (03CR) 10Alex Monk: fluorine: Use the role keyword (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/246980 (owner: 10Muehlenhoff) [15:00:41] but it repeatedly fails [15:02:46] i'm swatting things way too often, recently. [15:02:47] hi. [15:03:07] Can anyone direct me to a list of all statsd hosts? [15:03:25] MatmaRex: SWAT yourself a break :) [15:03:30] dcausse: for testwikidatawiki, it reports an index difference and says we need to reindex/remove [15:03:49] also, why did jouncebot not ping me? it used to ping people. [15:04:02] oh, unless there really is just one :P [15:04:03] aude: where is this index, and why is it different? [15:04:09] testwikidatawiki [15:04:21] on deployment-elastic? [15:04:24] * aude is on tin (though done with deploymetns for now) [15:04:45] i am not sure how to debug this [15:04:53] https://test.wikidata.org/w/api.php?action=cirrus-mapping-dump [15:05:09] so if you're running scripts from tin it should be on the eqiad cluster [15:05:13] addshore: clarify? :) [15:05:38] hm [15:06:23] aude can you paste me the maint command you're running? [15:06:30] (03CR) 10Alexandros Kosiaris: [C: 04-1] cxserver: Add JWT token support (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/247819 (https://phabricator.wikimedia.org/T116134) (owner: 10KartikMistry) [15:07:06] dcausse: mwscript extensions/CirrusSearch/maintenance/updateSearchIndexConfig.php --wiki testwikidatawiki [15:07:07] dcausse: actually everything is back on the 'old config', while debugging an issue with job queue it was reverted yesterday (but turned out to not be the culprit) [15:07:16] this was ok on beta [15:07:18] but re-deploying the multi-cluster config in swat again [15:07:28] ebernhardson: ok [15:07:38] so i should wait :) [15:07:41] (03CR) 10Yurik: [C: 031] service::node support overriding the repository [puppet] - 10https://gerrit.wikimedia.org/r/247846 (owner: 10Alexandros Kosiaris) [15:07:53] maybe :) [15:08:09] hasharMeeting: thanks a ton for the help with the perf regression task [15:08:10] not seeing anyone jump at SWAT, i suppose i can deploy [15:08:19] JohnFLewis: well, https://wikitech.wikimedia.org/wiki/Statsd actually says simply use statsd.eqiad.wmnet, doesnt look like I can access that from this box though [15:08:30] * aude is done for the moment, but would like to do more after swat [15:09:29] addshore: ah; statsd.eqiad.wmnet is graphite1001. /me looks [15:09:32] (03PS1) 10Luke081515: Add three new groups to ruwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247850 (https://phabricator.wikimedia.org/T116143) [15:10:13] yeh, I expect it's just fire walled from where I was looking ;) [15:10:37] aude: do you still have the maint script ouput? (maybe it's not related to mapping) [15:10:42] ebernhardson: could you back out the wikimediaevents change? [15:10:47] dcausse: yep [15:10:52] ori: yea we can try [15:10:59] you shouldn't use the jstorage library at all, it is a piece of crap [15:11:12] what do use for localStorage? this is the one i was told to use [15:11:28] (based on some ticket complaining about having 4 different localStorage wrappers in various mediawiki extensions + core) [15:11:42] just use localStorage [15:11:56] then i have to write a library to handle TTL's :S but ok [15:12:02] dcausse: https://phabricator.wikimedia.org/P2213 [15:12:05] or mw.storage if you don't want your own try / catch [15:13:04] aude: it's the analyzers, we recently added --justMapping option that *could* allow us to work around this problem [15:13:15] hm [15:13:33] would that be safe to try? [15:13:34] but not sure why the analyzers have changed :/ [15:13:38] * aude can wait until after swat [15:13:54] aude: it's still in gerrit for review :( [15:14:00] oh [15:15:33] addshore: from looking at it quickly, what are you trying to achieve? firewall isn't the issue, it'll be permissions if that [15:16:06] well, my plan was to send things to statsd from stat1002 ;) [15:16:23] !log ebernhardson@tin Synchronized php-1.27.0-wmf.2/extensions/Cite/Cite_body.php: Do not double-parse error references duplicate key (duration: 00m 19s) [15:16:26] MatmaRex: ^ [15:16:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:17:06] dcausse: would https://wikitech.wikimedia.org/wiki/Search#In_place_reindex be what we need to do? or might there be a better / different way? [15:17:21] (03CR) 10Mobrovac: service::node support overriding the repository (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/247846 (owner: 10Alexandros Kosiaris) [15:17:58] ebernhardson: thanks, verified on en.wp [15:18:02] !log ebernhardson@tin Synchronized php-1.27.0-wmf.3/extensions/Cite/Cite_body.php: Do not double-parse error references duplicate key (duration: 00m 17s) [15:18:05] aude: we can't because it will reindex the content by reading it from es, so if docs in es do not have the coord it's useless :( [15:18:06] MatmaRex: ^ :) [15:18:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:18:12] oh [15:18:37] ebernhardson: for wmf.3 i'd have to create some MediaWiki: namespace pages somewhere to test, so i won't bother. it works as it should on wmf.2 [15:18:42] thanks! [15:18:47] aude: there is a way to force a real reindex, that reads from the sql databases, but i don't think anyone on the team has done that, well ever [15:18:52] aude: but nik said it takes a long time :) [15:19:15] for test.wikidata, can't imagine a problem with that [15:19:30] but then if we have the same issue on wikidata, then.... [15:19:46] (03CR) 10EBernhardson: [C: 032] Revert "Revert "Revert "Revert "Enable config for all three search clusters, but only write to eqiad"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247845 (owner: 10EBernhardson) [15:20:09] (03PS4) 10KartikMistry: cxserver: Add JWT token support [puppet] - 10https://gerrit.wikimedia.org/r/247819 (https://phabricator.wikimedia.org/T116134) [15:20:12] (03Merged) 10jenkins-bot: Revert "Revert "Revert "Revert "Enable config for all three search clusters, but only write to eqiad"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247845 (owner: 10EBernhardson) [15:20:15] addshore: how are you sending data? looking at it; sending it to statsd.eqiad.wmnet:8125 should work from hosts [15:20:15] aude: I think we have 2 options: 1/ merge the --justMapping option to workaround the issue you had on tin and fix refreshLink _or_ 2/ a long full rebuild [15:20:16] for adding coordinates, i was thinking refresh links (but doesn't seem to work) for cirrus [15:20:34] heh.. that commit summary [15:20:53] if there was a way for cirrus to get the coordinates from geo_tags, thent hat would be nicer [15:21:31] ebernhardson: think that wins the award of 'most reverts in a patch to date' [15:21:45] JohnFLewis: i've got at least one more in my history with the same number ;) [15:22:03] ebernhardson: revert it and make it more ;) [15:23:15] JohnFLewis: well, "echo "test.add.foo:1|c" | nc -w 1 -u statsd.eqiad.wmnet 8125" [15:23:38] !log ebernhardson@tin Synchronized wmf-config/CirrusSearch-production.php: CirrusSearch multi-datacenter configuration (duration: 00m 17s) [15:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:23:56] !log ebernhardson@tin Synchronized wmf-config/CirrusSearch-common.php: CirrusSearch multi-datacenter configuration (duration: 00m 17s) [15:24:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:24:14] !log ebernhardson@tin Synchronized wmf-config/CommonSettings.php: CirrusSearch multi-datacenter configuration (duration: 00m 17s) [15:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:24:32] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: CirrusSearch multi-datacenter configuration (duration: 00m 17s) [15:24:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:27:47] any thoughts JohnFLewis ? ;) [15:28:32] addshore: honestly, nope :) I can't see anything firewall-y that would block it from a look though [15:28:36] dcausse: i am looking at https://gerrit.wikimedia.org/r/#/c/247788/ if that helps to get review and have someone try it (locally) [15:28:44] cool, ill try again later [15:28:56] at a glance, looks sane [15:29:01] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: touch and re-sync InitialiseSettings.php to bust cache (duration: 00m 17s) [15:29:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:29:16] aude: thanks! [15:29:22] is mw1070 de-pooled? [15:29:38] ebernhardson: don't know but had trouble with mw1083 [15:30:33] aude: oh your right, its mw1083. its syncing *from* 1070 [15:30:38] yeah [15:30:45] i tried sync-common there [15:31:15] addshore: looking at graphite; I see a test.foo which received data it seems today [15:31:34] (03PS2) 10Ori.livneh: ~ori: add `branchdir` script [puppet] - 10https://gerrit.wikimedia.org/r/247766 [15:31:47] (03CR) 10Ori.livneh: [C: 032 V: 032] ~ori: add `branchdir` script [puppet] - 10https://gerrit.wikimedia.org/r/247766 (owner: 10Ori.livneh) [15:32:53] ori: can you depool mw1083, its read-only filesystem is preventing updates [15:32:54] addshore: specifically https://graphite.wikimedia.org/render/?width=586&height=308&_salt=1445441534.023&target=test.bash.foo.rate&target=test.bash.foo.lower [15:33:08] JohnFLewis: oooooooooohhh [15:33:17] it looks like it's you :) [15:34:12] JohnFLewis, lovely, so apparently there is quite a delay for the first reporting of the metric on the web interface though! [15:34:27] 6operations, 7Icinga, 7Mail: fix/streamline mail routing off of neon - https://phabricator.wikimedia.org/T80890#1742565 (10faidon) p:5Normal>3Low It's still a valid bug, however we now have multiple outgoing relays so it's less of a concern. [15:34:49] 6operations, 7Icinga, 7Mail: fix/streamline mail routing off of neon - https://phabricator.wikimedia.org/T80890#1742567 (10faidon) a:5faidon>3None [15:35:45] 6operations, 7Icinga: ganeti: PROCS CRITICAL: 2 processes ... - https://phabricator.wikimedia.org/T116111#1742570 (10faidon) Pretty sure it's Ganeti's watcher running via cron. [15:36:17] (03CR) 10Alexandros Kosiaris: service::node support overriding the repository (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/247846 (owner: 10Alexandros Kosiaris) [15:37:23] PROBLEM - YARN NodeManager Node-State on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:38:21] !log depooled mw1083 [15:38:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:39:06] thank you akosiaris [15:39:12] RECOVERY - YARN NodeManager Node-State on analytics1038 is OK: OK: YARN NodeManager analytics1038.eqiad.wmnet:8041 Node-State: RUNNING [15:39:33] ori: due to a variety of changes to WME, reverting that patch is actually a whole string of reverts. How about if i just throw a 'return' at the top of the script to guarantee it doesnt run, would that still allow to verify perf impact? [15:40:25] i'm not sure; it depends where extra dependencies are expressed [15:40:41] ori: mw.loader.using [15:40:46] better to do a whole string of reverts. you can squash the reverts into a single commit if you're worried about gerrit change spam. [15:40:49] ok [15:42:34] 6operations, 10ops-eqiad: mw1083's sda disk is dying - https://phabricator.wikimedia.org/T116184#1742588 (10akosiaris) 3NEW [15:42:46] 6operations, 10ops-eqiad: mw1083's sda disk is dying - https://phabricator.wikimedia.org/T116184#1742596 (10akosiaris) p:5Triage>3Normal [15:45:37] (03PS1) 10Muehlenhoff: Assign salt grain for mw_rc_irc [puppet] - 10https://gerrit.wikimedia.org/r/247854 [15:46:25] ebernhardson: is swat done? [15:46:43] (03CR) 10Muehlenhoff: [C: 032 V: 032] Assign salt grain for mw_rc_irc [puppet] - 10https://gerrit.wikimedia.org/r/247854 (owner: 10Muehlenhoff) [15:46:52] aude: not yet, undeploying some code that was marked as a potential perf problem [15:46:56] ok [15:46:58] (03PS1) 10Alexandros Kosiaris: gsb: Amend check command [puppet] - 10https://gerrit.wikimedia.org/r/247856 [15:47:06] (sorry, the patches wern't pre-prepared. it was requested just as swat started) [15:47:16] no hurry [15:49:11] !log ebernhardson@tin Synchronized php-1.27.0-wmf.3/extensions/WikimediaEvents/: Revert SearchSatisfaction schema related changes due to suspected perf impact (duration: 00m 18s) [15:49:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:50:27] !log ebernhardson@tin Synchronized php-1.27.0-wmf.2/extensions/WikimediaEvents/: Revert SearchSatisfaction schema related changes due to suspected perf impact (duration: 00m 18s) [15:50:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:50:33] ori: ^^ lemme know (later) if it fixes the problem [15:50:45] aude: should be good now [15:51:56] ebernhardson: thanks dude [15:57:30] (03CR) 10Mobrovac: [C: 031] service::node support overriding the repository [puppet] - 10https://gerrit.wikimedia.org/r/247846 (owner: 10Alexandros Kosiaris) [15:58:30] (03PS2) 10Alexandros Kosiaris: gsb: Amend check command [puppet] - 10https://gerrit.wikimedia.org/r/247856 [15:58:38] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] gsb: Amend check command [puppet] - 10https://gerrit.wikimedia.org/r/247856 (owner: 10Alexandros Kosiaris) [15:59:53] (03PS1) 10Alexandros Kosiaris: mw1083: depooled, remove from dsh [puppet] - 10https://gerrit.wikimedia.org/r/247859 (https://phabricator.wikimedia.org/T116184) [16:00:11] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] mw1083: depooled, remove from dsh [puppet] - 10https://gerrit.wikimedia.org/r/247859 (https://phabricator.wikimedia.org/T116184) (owner: 10Alexandros Kosiaris) [16:00:17] (03PS2) 10Alexandros Kosiaris: mw1083: depooled, remove from dsh [puppet] - 10https://gerrit.wikimedia.org/r/247859 (https://phabricator.wikimedia.org/T116184) [16:00:22] (03CR) 10Alexandros Kosiaris: [V: 032] mw1083: depooled, remove from dsh [puppet] - 10https://gerrit.wikimedia.org/r/247859 (https://phabricator.wikimedia.org/T116184) (owner: 10Alexandros Kosiaris) [16:06:02] thcipriani: if I need to add anything in PrivateSettings.php, do I need to schedule for SWAT? [16:06:11] (or just deploy myself)? [16:14:02] (03CR) 10Alexandros Kosiaris: cxserver: Add JWT token support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/247819 (https://phabricator.wikimedia.org/T116134) (owner: 10KartikMistry) [16:16:58] (03PS1) 10Andrew Bogott: Allow wikitech-static to drift a bit more than two days away from wikitech. [puppet] - 10https://gerrit.wikimedia.org/r/247863 (https://phabricator.wikimedia.org/T101803) [16:18:53] (03PS2) 10Alexandros Kosiaris: service::node support overriding the repository [puppet] - 10https://gerrit.wikimedia.org/r/247846 [16:19:03] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] service::node support overriding the repository [puppet] - 10https://gerrit.wikimedia.org/r/247846 (owner: 10Alexandros Kosiaris) [16:19:07] (03PS2) 10Dzahn: Allow wikitech-static to drift a bit more than two days away from wikitech. [puppet] - 10https://gerrit.wikimedia.org/r/247863 (https://phabricator.wikimedia.org/T101803) (owner: 10Andrew Bogott) [16:19:14] (03CR) 10Dzahn: [C: 032] Allow wikitech-static to drift a bit more than two days away from wikitech. [puppet] - 10https://gerrit.wikimedia.org/r/247863 (https://phabricator.wikimedia.org/T101803) (owner: 10Andrew Bogott) [16:19:51] (03PS3) 10Dzahn: Allow wikitech-static to drift a bit more than two days away from wikitech. [puppet] - 10https://gerrit.wikimedia.org/r/247863 (https://phabricator.wikimedia.org/T101803) (owner: 10Andrew Bogott) [16:21:36] (03PS9) 10Alexandros Kosiaris: maps: Add tileratorui service [puppet] - 10https://gerrit.wikimedia.org/r/244436 (https://phabricator.wikimedia.org/T116062) [16:22:06] (03PS1) 10Addshore: Retain wikidata.daily.* graphite metrics for longer [puppet] - 10https://gerrit.wikimedia.org/r/247866 [16:24:15] (03PS2) 10Dzahn: admin: create agomez and add to stats groups [puppet] - 10https://gerrit.wikimedia.org/r/247467 (https://phabricator.wikimedia.org/T115666) [16:24:35] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1742697 (10Dzahn) thanks @ottomata. amended the change https://gerrit.wikimedia.org/r/#/c/247467/ [16:26:24] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1742705 (10Dzahn) @atgo did you get to create a key yet? [16:26:37] mutante: since everything is green now, shall I close that wikitech-static phab task? [16:26:50] 10Ops-Access-Requests, 6operations: Grant Joseph and Dan deploy permissions on aqs100[1-3] - https://phabricator.wikimedia.org/T116169#1742706 (10Dzahn) a:3Dzahn [16:27:00] 10Ops-Access-Requests, 6operations: Grant Joseph and Dan deploy permissions on aqs100[1-3] - https://phabricator.wikimedia.org/T116169#1742708 (10Dzahn) p:5Triage>3Normal [16:28:25] andrewbogott: yes, close the ticket and leave the comment in icinga (which it does unless somebody explicitely deletes it), that way if it ever comes back we just follow the link again and open it [16:28:38] (03PS5) 10KartikMistry: cxserver: Add JWT token support [puppet] - 10https://gerrit.wikimedia.org/r/247819 (https://phabricator.wikimedia.org/T116134) [16:28:39] ok! [16:28:45] that's what got me there this time [16:29:39] 6operations, 6Labs, 10Labs-Infrastructure, 10Wikimedia-Apache-configuration, and 2 others: wikitech-static sync broken - https://phabricator.wikimedia.org/T101803#1742711 (10Andrew) 5Open>3Resolved All green, and I've made the test less touchy. So, closing this bug for now. [16:31:38] 10Ops-Access-Requests, 6operations: Grant Joseph and Dan deploy permissions on aqs100[1-3] - https://phabricator.wikimedia.org/T116169#1742713 (10Dzahn) This sounds like we need a new admin group that does not exist yet. [16:32:54] 10Ops-Access-Requests, 6operations: Grant Joseph and Dan deploy permissions on aqs100[1-3] - https://phabricator.wikimedia.org/T116169#1742715 (10Dzahn) @milimetric or @ottomata could you list the commands needed for deployment? [16:34:33] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [16:36:03] PROBLEM - YARN NodeManager Node-State on analytics1032 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:37:43] RECOVERY - YARN NodeManager Node-State on analytics1032 is OK: OK: YARN NodeManager analytics1032.eqiad.wmnet:8041 Node-State: RUNNING [16:39:45] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [16:42:29] (03PS1) 10Dzahn: archiva: add ssl cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247868 [16:43:59] (03PS2) 10Dzahn: archiva: add ssl cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247868 [16:44:41] (03PS3) 10Dzahn: archiva: add ssl cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247868 [16:44:59] (03CR) 10Dzahn: [C: 032] archiva: add ssl cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247868 (owner: 10Dzahn) [16:45:05] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 5 below the confidence bounds [16:50:33] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 11 data above and 5 below the confidence bounds [16:52:34] 6operations, 10netops, 10procurement: audit juniper hardware locations for support coverage - https://phabricator.wikimedia.org/T116051#1742772 (10RobH) RT of the renewal mentioned: https://rt.wikimedia.org/Ticket/Display.html?id=9610 [16:53:17] !log Attached WeiaR@enwiki to the global account of the same name. T115699 [16:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:54:34] PROBLEM - mediawiki-installation DSH group on mw1083 is CRITICAL: Host mw1083 is not in mediawiki-installation dsh group [16:54:53] thanks hoo :D [16:55:57] (03CR) 10Mobrovac: Add a public endpoint for AQS [puppet] - 10https://gerrit.wikimedia.org/r/245887 (https://phabricator.wikimedia.org/T114830) (owner: 10Milimetric) [17:01:49] 6operations, 10Traffic, 7Performance: missing SPDY coalesce for upload.wm.o for images ref'd in projects' page outputs - https://phabricator.wikimedia.org/T116132#1742831 (10Krinkle) [17:04:54] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 14 data above and 5 below the confidence bounds [17:10:14] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [17:14:02] (03PS1) 10Luke081515: Add throttle exception for eswiki at 2015-10-23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247872 (https://phabricator.wikimedia.org/T116183) [17:14:43] Can someone review that fast? The request it late, so we need to deploy it till friday [17:16:09] Luke081515: Sign it up for SWAT [17:16:31] ok [17:17:59] done [17:23:22] (03PS2) 10Krinkle: beta: Remove commented out rules for www2.knams.wikimedia.org/stats [puppet] - 10https://gerrit.wikimedia.org/r/240919 [17:23:56] (03PS1) 10Dzahn: rc stream: add ssl cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247874 [17:25:27] (03CR) 10Dzahn: "maybe remove it at once from non-beta as well then, so we have less of a diff between prod and beta" [puppet] - 10https://gerrit.wikimedia.org/r/240919 (owner: 10Krinkle) [17:26:21] (03CR) 10Krinkle: "@Dzhan: They only exist in beta configs. They're already gone in prod." [puppet] - 10https://gerrit.wikimedia.org/r/240919 (owner: 10Krinkle) [17:26:56] (03PS2) 10Dzahn: rcstream: add ssl cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247874 [17:27:40] (03CR) 10Dzahn: [C: 032] beta: Remove commented out rules for www2.knams.wikimedia.org/stats [puppet] - 10https://gerrit.wikimedia.org/r/240919 (owner: 10Krinkle) [17:28:35] (03PS3) 10Dzahn: rcstream: add ssl cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247874 [17:28:59] (03CR) 10Dzahn: [C: 032] rcstream: add ssl cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247874 (owner: 10Dzahn) [17:51:27] (03PS1) 10Muehlenhoff: Assign salt grains for deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/247877 [17:51:29] (03PS1) 10Muehlenhoff: Assign salt grains for logging hosts [puppet] - 10https://gerrit.wikimedia.org/r/247878 [17:51:31] (03PS1) 10Muehlenhoff: Assign salt grains for logstash [puppet] - 10https://gerrit.wikimedia.org/r/247879 [17:51:33] (03PS1) 10Muehlenhoff: Assign salt grains for package::builder [puppet] - 10https://gerrit.wikimedia.org/r/247880 [17:51:35] (03PS1) 10Muehlenhoff: Assign salt grains for racktables [puppet] - 10https://gerrit.wikimedia.org/r/247881 [17:51:45] PROBLEM - Analytics Cassanda CQL query interface on aqs1003 is CRITICAL: Connection timed out [17:53:44] (03PS2) 10Nuria: Enabling mapjoins in hive by default [puppet/cdh] - 10https://gerrit.wikimedia.org/r/247758 (https://phabricator.wikimedia.org/T116202) [17:57:49] RECOVERY - Analytics Cassanda CQL query interface on aqs1003 is OK: TCP OK - 0.000 second response time on port 9042 [18:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151021T1800). [18:00:14] Robh: now the problem is on our cage when leaving for lunch the cage door was left open and when back stay open [18:00:44] (03CR) 10Muehlenhoff: [C: 031] interface: do not 'ensure latest',do require_package [puppet] - 10https://gerrit.wikimedia.org/r/247008 (https://phabricator.wikimedia.org/T115348) (owner: 10Dzahn) [18:02:53] (03PS1) 10Alexandros Kosiaris: gsb: Another try at fixing it [puppet] - 10https://gerrit.wikimedia.org/r/247888 [18:10:01] (03CR) 10Alexandros Kosiaris: [C: 032] gsb: Another try at fixing it [puppet] - 10https://gerrit.wikimedia.org/r/247888 (owner: 10Alexandros Kosiaris) [18:12:41] heya moritzm, still around? [18:12:44] .deb packaging question [18:12:52] madhuvishy and I want to use https://github.com/linkedin/Burrow [18:12:55] it is go! [18:13:04] i have never made a go debian package [18:13:06] any tips? [18:13:12] is there a go debhelper?! :) [18:13:25] ottomata: i found this http://pkg-go.alioth.debian.org/packaging.html [18:13:46] HMM, aye, [18:13:47] cooool [18:13:50] <_joe_> ottomata: ahahahaha [18:13:50] yes there is, dh-make-golang iirc [18:13:53] it says there's dh-golang and dh-make-golang [18:13:55] I've never made a go package either [18:14:00] <_joe_> go packaging, prepare yourself for so much fun [18:14:04] <_joe_> I did [18:14:08] <_joe_> :) [18:14:08] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 5 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1743164 (10mobrovac) [18:14:22] _joe_: good fun? :) [18:14:26] <_joe_> madhuvishy: nope [18:14:29] _joe_: i know you can help!? wee [18:14:44] it can't be that bad [18:14:47] reading this too [18:14:47] http://stackoverflow.com/questions/15104089/packaging-golang-application-for-debian [18:15:01] <_joe_> paravoid: it's just bad because almost nothing is already packaged [18:15:02] ottomata: i think that's pretty old [18:15:19] <_joe_> and you have most of the times to figure out what is missing by reading the source [18:15:26] the first answer has a recent update though [18:15:50] _joe_: https://people.debian.org/~stapelberg/2015/07/27/dh-make-golang.html claims they have been packaging stuff [18:15:53] <_joe_> ottomata: actually as soon as I feel better I have to upgrade etcd [18:15:59] although i know nothing about packaging [18:16:27] <_joe_> madhuvishy: it might have changed over the last 4-6 months [18:16:32] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 5 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1743180 (10chasemp) [18:16:37] labsdb1004 is now reloading its dump- in theory, no alters should be given (all are ok), so I won't extend the downtime [18:16:37] etcd is already packaged in debian isn't it? [18:16:40] this does seem helpful [18:16:40] https://people.debian.org/~stapelberg/2015/07/27/dh-make-golang.html [18:16:41] and recent [18:16:43] going to try [18:16:45] _joe_: it looks like it, all this stuff seems to be super recent - last 2 months [18:16:51] <_joe_> paravoid: it is, in sid I think [18:17:01] well yeah, sid/stretch [18:17:06] and we could backport to jessie-backports too [18:17:10] but just in case, the SAL I did this morning still applies- do not panic for anything on that host [18:17:17] 6operations, 6Release-Engineering-Team, 7Database, 5Patch-For-Review: Recover missing values from user_properties tables - https://phabricator.wikimedia.org/T114899#1743184 (10demon) 5Open>3Resolved a:3demon >>! In T114899#1709471, @jcrespo wrote: > We can recover preferences on an individual bases,... [18:17:20] ottomata: I tried apt-get install dh-make-golang and it never found the package - dont know if some dumbness on my end (i did do apt-get update) [18:17:33] <_joe_> madhuvishy: it's probably only in sid [18:17:34] s/alters/alerts [18:17:39] _joe_: aah [18:17:45] yeah [18:17:53] looks like it [18:17:56] <_joe_> hence paravoid's suggestion to backport it [18:18:06] mm hmmm [18:18:20] <_joe_> and yeah dh-make-golang might help a _lot_ [18:18:52] madhuvishy: which target platform do you need, only jessie or also trusty/precise? [18:19:11] (03PS3) 10Nuria: Enabling mapjoins in hive by default [puppet/cdh] - 10https://gerrit.wikimedia.org/r/247758 (https://phabricator.wikimedia.org/T116202) [18:19:30] <_joe_> ok moritzm might help as well :) [18:19:40] moritzm: I think jessie will do. [18:19:46] ottomata: ^ [18:20:14] <_joe_> it feels good working in a team where a sizeable amount of people know debian packaging better than me :) [18:20:38] 6operations: test task - https://phabricator.wikimedia.org/T116210#1743202 (10chasemp) 3NEW [18:21:40] moritzm: yeah, for dh-make-golang, not sure. [18:21:54] i guess whatever, because that's just for generating the debian/ stuff [18:21:59] but, for burrow, the package we want to build [18:22:02] jessie is likely fine [18:22:06] i'm not sure where we'll run that yet [18:22:06] <_joe_> ottomata: you actually need dh-make-golang in a VM if you want [18:22:32] ja just cause i only need it to generate the inital packaging, right? [18:23:43] I briefly tried to build dh-make-golang on my jessie box and there's a bit of a tail of further packages also not yet in jessie [18:23:54] ottomata: you only need it once to generate the source package [18:24:06] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 5 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1743214 (10chasemp) [18:24:07] so a debian unstable chroot would do [18:24:43] but we'll likely need this further and building this for jessie isn't much work [18:24:46] aye [18:24:47] indeed [18:25:18] what's needed at build time is dh-golang and that is part of jessie [18:26:35] i can make/upload a jessie backport of dh-make-golang tomorrow [18:28:43] ok awesome, thanks moritzm [18:29:09] 6operations, 6Phabricator, 6Project-Creators: create acl*operationsteam & acl*procurement projects, cease using #operations for access control - https://phabricator.wikimedia.org/T114135#1743217 (10RobH) Post migration sync up (@chasemp and I have been cleanging things up). Included in migration was also t... [18:30:52] 6operations, 6Editing-Department, 6Parsing-Team, 6Services: Services team goals October - December 2015 (Q2 2015/16) - https://phabricator.wikimedia.org/T111819#1743218 (10GWicke) [18:36:49] 6operations, 10ContentTranslation-Deployments, 10ContentTranslation-cxserver, 10Parsoid, and 4 others: Decom parsoid-lb.eqiad.wikimedia.org entrypoint - https://phabricator.wikimedia.org/T110474#1743245 (10Krenair) [18:37:57] PROBLEM - ElasticSearch health check for shards on nobelium is CRITICAL: CRITICAL - elasticsearch inactive shards 349 threshold =0.1% breach: status: yellow, number_of_nodes: 1, unassigned_shards: 349, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 353, cluster_name: labsearch, relocating_shards: 0, active_shards: 353, initializing_shards: 0, number_of_data_nodes: 1, delayed_unassign [18:38:57] 6operations, 10MediaWiki-extensions-BounceHandler, 5Patch-For-Review: Need an administrative front end for BounceHandler - https://phabricator.wikimedia.org/T114020#1743249 (1001tonythomas) Hope to see the graphs in thttps://grafana.wikimedia.org soon ``` 00:00 tonythomas: I'll create a dashboard... [18:39:39] ebernhardson: ^ ? [18:40:24] chasemp: i'm initializing indices in codfw and labsearch right now [18:40:36] chasemp: so any warning is probably just some half-initialized index at the moment of the check [18:40:45] (in labsearch and codfw. eqiad errors are real) [18:42:11] !log initializing elasticsearch index mapping for all wikis in the codfw and labsearch ES clusters [18:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:42:19] might as well :) [18:42:58] will be starting the index copies a little later today [18:44:40] k [18:45:57] PROBLEM - dhclient process on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:45:58] PROBLEM - Check size of conntrack table on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:46:16] PROBLEM - DPKG on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:46:17] PROBLEM - puppet last run on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:46:27] PROBLEM - Hadoop NodeManager on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:46:30] PROBLEM - Disk space on Hadoop worker on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:46:36] PROBLEM - salt-minion processes on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:46:57] PROBLEM - RAID on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:47:06] PROBLEM - SSH on analytics1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:47:07] PROBLEM - Hadoop DataNode on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:47:07] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:47:17] PROBLEM - configured eth on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:47:34] * ebernhardson wonders if those could be bundled together into 'many failures on analytics1039' [18:47:47] PROBLEM - Disk space on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:47:57] RECOVERY - DPKG on analytics1039 is OK: All packages OK [18:47:59] yes, they could. using service dependencies. icinga can do it, but we need puppet abstractions [18:48:06] RECOVERY - puppet last run on analytics1039 is OK: OK: Puppet is currently enabled, last run 16 minutes ago with 0 failures [18:48:06] RECOVERY - Hadoop NodeManager on analytics1039 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [18:48:08] RECOVERY - Disk space on Hadoop worker on analytics1039 is OK: DISK OK [18:48:20] RECOVERY - salt-minion processes on analytics1039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [18:48:38] RECOVERY - RAID on analytics1039 is OK: OK: optimal, 13 logical, 14 physical [18:48:47] RECOVERY - SSH on analytics1039 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [18:48:48] RECOVERY - Hadoop DataNode on analytics1039 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [18:49:06] RECOVERY - configured eth on analytics1039 is OK: OK - interfaces up [18:49:27] RECOVERY - Disk space on analytics1039 is OK: DISK OK [18:49:28] RECOVERY - dhclient process on analytics1039 is OK: PROCS OK: 0 processes with command name dhclient [18:49:36] RECOVERY - Check size of conntrack table on analytics1039 is OK: OK: nf_conntrack is 0 % full [18:52:37] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [18:56:45] this socket timeout thing.... [18:57:11] 6operations, 10MediaWiki-extensions-BounceHandler, 5Patch-For-Review: Need an administrative front end for BounceHandler - https://phabricator.wikimedia.org/T114020#1743304 (10Legoktm) a:3Legoktm [19:03:05] matanya: https://phabricator.wikimedia.org/project/profile/1025/ [19:03:13] why havent you joined yet!?! ;] [19:03:33] other than it just happened and all... [19:03:47] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:04:37] robh: well joined :p [19:05:36] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [19:08:16] 6operations, 6Phabricator, 6Project-Creators: create acl*operationsteam & acl*procurement projects, cease using #operations for access control - https://phabricator.wikimedia.org/T114135#1743376 (10RobH) So @chasemp created the associated project (new #operations) as the old one was renamed to #acl*operation... [19:08:23] 6operations, 6Phabricator, 6Project-Creators: Create policy projects and convert people projects to open - https://phabricator.wikimedia.org/T90491#1743379 (10RobH) [19:08:25] 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1743378 (10RobH) [19:09:06] 6operations, 6Phabricator: unable to subscribe to operations tag after migration and merge from ops-core and ops-request - https://phabricator.wikimedia.org/T89053#1743382 (10RobH) 5Open>3Resolved a:3RobH So with the details listed in T114135, #operations is now joinable by anyone who wants to join. #ac... [19:10:12] 7Puppet, 6operations, 10Continuous-Integration-Infrastructure: Puppet (silently) fails to setup apache on new trusty instances - https://phabricator.wikimedia.org/T91832#1743389 (10hashar) 5Open>3Resolved a:3hashar I haven't hit that issue when building a new Trusty slave. [19:10:24] 6operations, 6Phabricator: unable to subscribe to operations tag after migration and merge from ops-core and ops-request - https://phabricator.wikimedia.org/T89053#1743395 (10RobH) I noticed after my update there are long standing discussions about policy involving project creation, however it seems outside of... [19:10:40] 6operations, 6Phabricator, 6Project-Creators: create acl*operationsteam & acl*procurement projects, cease using #operations for access control - https://phabricator.wikimedia.org/T114135#1743399 (10chasemp) Outcome: * This mucked up the existing work board for #operations but since we don’t use it we decide... [19:11:51] 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1743410 (10RobH) As all the ACL groups have been completed, the next step is to simply start testing regular use. (We've done our initial task creation and attachment tests, as well as emailing in t... [19:15:52] hadoop is BUSsaayyy [19:17:03] (03PS1) 10Dzahn: deactivate vik[i]pedia.com.tr Turkish domains [dns] - 10https://gerrit.wikimedia.org/r/247903 (https://phabricator.wikimedia.org/T83077) [19:18:57] (03PS2) 10Dzahn: deactivate vik[i]pedia.com.tr Turkish domains [dns] - 10https://gerrit.wikimedia.org/r/247903 (https://phabricator.wikimedia.org/T83077) [19:19:17] (03PS3) 10Dzahn: deactivate vikipedi[a].com.tr Turkish domains [dns] - 10https://gerrit.wikimedia.org/r/247903 (https://phabricator.wikimedia.org/T83077) [19:21:36] (03CR) 10John F. Lewis: [C: 031] deactivate vikipedi[a].com.tr Turkish domains [dns] - 10https://gerrit.wikimedia.org/r/247903 (https://phabricator.wikimedia.org/T83077) (owner: 10Dzahn) [19:21:51] (03CR) 10John F. Lewis: [C: 031] deactivate vikipedia.com.tr [dns] - 10https://gerrit.wikimedia.org/r/244082 (owner: 10Dzahn) [19:22:12] (03CR) 10John F. Lewis: [C: 031] deactivate wikimania.asia [dns] - 10https://gerrit.wikimedia.org/r/244103 (owner: 10Dzahn) [19:22:45] (03CR) 10John F. Lewis: [C: 031] deactivate wikiknihy.cz [dns] - 10https://gerrit.wikimedia.org/r/244104 (owner: 10Dzahn) [19:23:05] (03CR) 10John F. Lewis: [C: 031] deactivate wikipaedia.net [dns] - 10https://gerrit.wikimedia.org/r/244090 (owner: 10Dzahn) [19:23:07] (03CR) 10Dzahn: "eh, it's actually 2 domains, sorry, duplicate for both here:" [dns] - 10https://gerrit.wikimedia.org/r/244082 (owner: 10Dzahn) [19:23:22] (03CR) 10John F. Lewis: [C: 031] deactivate wiki[p|m]ediastories.[com|net|org] [dns] - 10https://gerrit.wikimedia.org/r/244086 (owner: 10Dzahn) [19:23:28] (03Abandoned) 10Dzahn: deactivate vikipedia.com.tr [dns] - 10https://gerrit.wikimedia.org/r/244082 (owner: 10Dzahn) [19:23:38] (03CR) 10John F. Lewis: [C: 031] deactivate wikimediacommons.[co.uk|eu|info|jp.net|mobi|net|org] [dns] - 10https://gerrit.wikimedia.org/r/244092 (owner: 10Dzahn) [19:23:52] (03CR) 10John F. Lewis: [C: 031] deactivate wikimemory.org [dns] - 10https://gerrit.wikimedia.org/r/244101 (owner: 10Dzahn) [19:24:06] (03CR) 10John F. Lewis: [C: 031] deactivate wikimaps.[com|net|org] domains [dns] - 10https://gerrit.wikimedia.org/r/244078 (owner: 10Dzahn) [19:24:19] (03CR) 10John F. Lewis: [C: 031] deactivate wekipedia.com [dns] - 10https://gerrit.wikimedia.org/r/244085 (owner: 10Dzahn) [19:24:29] (03CR) 10John F. Lewis: [C: 031] deactivate wikimedia.biz [dns] - 10https://gerrit.wikimedia.org/r/244084 (owner: 10Dzahn) [19:24:39] (03CR) 10John F. Lewis: [C: 031] deactivate wikidisclosure.[com|org] [dns] - 10https://gerrit.wikimedia.org/r/243973 (owner: 10Dzahn) [19:24:47] (03CR) 10John F. Lewis: [C: 031] deactivate wikifamily.[com|org] [dns] - 10https://gerrit.wikimedia.org/r/243972 (owner: 10Dzahn) [19:25:00] [sorry for the +1 spam!] [19:26:11] :) thanks, these can be treated almost like voting on wikis [19:26:18] it's more that than technical [19:26:47] so the more +1 the better [19:27:07] mutante: you need a independent person to +2 then! closes must be done by a unrelated party :P [19:27:29] and -1's with reasons need weight ;) [19:28:30] thanks robh! {{done}} [19:28:33] heh, fair. i see a scale between.. say "wikiartpedia.biz" and "wikiquotes.org" [19:28:49] so some i have done but some i will definitely wait [19:29:14] i am natural here [19:30:18] :) [19:31:06] (03PS4) 10Dzahn: deactivate vikipedi[a].com.tr Turkish domains [dns] - 10https://gerrit.wikimedia.org/r/247903 (https://phabricator.wikimedia.org/T83077) [19:40:08] PROBLEM - YARN NodeManager Node-State on analytics1038 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [19:41:56] RECOVERY - YARN NodeManager Node-State on analytics1038 is OK: OK: YARN NodeManager analytics1038.eqiad.wmnet:8041 Node-State: RUNNING [19:42:06] PROBLEM - Analytics Cassanda CQL query interface on aqs1003 is CRITICAL: Connection timed out [19:46:01] (03PS1) 10Dzahn: icinga: ssl cert monitoring for external services [puppet] - 10https://gerrit.wikimedia.org/r/247905 (https://phabricator.wikimedia.org/T114059) [19:46:20] (03PS6) 10Chad: scap: Add co-master configuration [puppet] - 10https://gerrit.wikimedia.org/r/224829 (https://phabricator.wikimedia.org/T104826) (owner: 10BryanDavis) [19:46:22] (03PS3) 10Chad: Move dsh code into scap where it belongs [puppet] - 10https://gerrit.wikimedia.org/r/247304 [19:47:15] (03PS2) 10Dzahn: icinga: ssl cert monitoring for external services [puppet] - 10https://gerrit.wikimedia.org/r/247905 (https://phabricator.wikimedia.org/T114059) [19:47:27] RECOVERY - Analytics Cassanda CQL query interface on aqs1003 is OK: TCP OK - 0.020 second response time on port 9042 [19:48:00] 6operations, 6Release-Engineering-Team, 7Database, 5Patch-For-Review, 7WorkType-Maintenance: Recover missing values from user_properties tables - https://phabricator.wikimedia.org/T114899#1743574 (10hashar) [19:52:28] RECOVERY - ElasticSearch health check for shards on nobelium is OK: OK - elasticsearch status labsearch: status: green, number_of_nodes: 1, unassigned_shards: 0, number_of_pending_tasks: 2, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 1787, cluster_name: labsearch, relocating_shards: 0, active_shards: 1787, initializing_shards: 0, number_of_data_nodes: 1, delayed_unassigned_shards: 0 [19:53:06] PROBLEM - Analytics Cassanda CQL query interface on aqs1003 is CRITICAL: Connection timed out [19:54:07] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [19:54:31] (03PS1) 10Dzahn: eventdonations: add SSL cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247907 [19:54:33] (03PS1) 10Dzahn: toolserver.org - add SSL cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247908 [19:54:35] (03PS1) 10Dzahn: planet: add ssl cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247909 [19:56:10] (03PS2) 10Dzahn: eventdonations: add SSL cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247907 [19:56:48] (03PS3) 10Dzahn: eventdonations: add SSL cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247907 [19:59:07] (03PS2) 10Dzahn: toolserver.org - add SSL cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247908 [19:59:37] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [19:59:38] (03PS1) 10Ottomata: Disable AQS cassandra CQL interface check until AQS is production ready [puppet] - 10https://gerrit.wikimedia.org/r/247910 [20:00:04] gwicke cscott arlolra subbu bearND mdholloway: Respected human, time to deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151021T2000). Please do the needful. [20:00:13] no parsoid deploy today [20:00:33] the RB deploy happened earlier this morning [20:01:02] (03PS2) 10Dzahn: planet: add ssl cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247909 [20:01:45] (03PS3) 10Dzahn: planet: add ssl cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247909 [20:03:57] RECOVERY - Analytics Cassanda CQL query interface on aqs1003 is OK: TCP OK - 0.998 second response time on port 9042 [20:04:36] (03CR) 10Madhuvishy: [C: 031] "I'm in favor!" [puppet] - 10https://gerrit.wikimedia.org/r/247910 (owner: 10Ottomata) [20:05:33] godog: [20:05:34] :( [20:05:36] Oct 20 03:40:14 cp4014 kernel: [13259932.218753] diamond[10119]: segfault at 7fc4953d4620 ip 00007fc4955de52d sp 00007fffc99561d0 error 4 in libvarnishapi.so.1.0.0[7fc4955d5000+12000] [20:05:52] https://graphite.wikimedia.org/render/?width=588&height=311&target=servers.cp4014.varnish.ulsfo.upload.frontend.request.client.method.get&target=servers.cp4014.varnish.ulsfo.upload.frontend.request.client.total&from=-72hours [20:06:04] (03CR) 10Yurik: "Can we redirect it to maps.wikimedia.org ?" [dns] - 10https://gerrit.wikimedia.org/r/244078 (owner: 10Dzahn) [20:06:06] i might resort to doing what ori is doing for some of these metrics [20:06:10] and send to statsd directly [20:06:12] rather than using diamond [20:06:23] (03PS3) 10Dzahn: icinga: ssl cert monitoring for external services [puppet] - 10https://gerrit.wikimedia.org/r/247905 (https://phabricator.wikimedia.org/T114059) [20:06:39] (03CR) 10Dzahn: [C: 032] "like gsbmonitoring does it too" [puppet] - 10https://gerrit.wikimedia.org/r/247905 (https://phabricator.wikimedia.org/T114059) (owner: 10Dzahn) [20:07:08] (03CR) 10Dzahn: "only if we think the redirect is worth getting a cert" [dns] - 10https://gerrit.wikimedia.org/r/244078 (owner: 10Dzahn) [20:08:51] 6operations, 6Phabricator, 7audits-data-retention: Enable mod_remoteip on Phabricator and ensure logs follow retention guidelines - https://phabricator.wikimedia.org/T114014#1743641 (10chasemp) https://secure.phabricator.com/D14315 [20:10:09] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/247905 (https://phabricator.wikimedia.org/T114059) (owner: 10Dzahn) [20:12:10] (03PS1) 10Ori.livneh: update dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/247913 [20:12:24] (03CR) 10Ori.livneh: [C: 032 V: 032] update dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/247913 (owner: 10Ori.livneh) [20:12:43] (03PS4) 10Dzahn: icinga: ssl cert monitoring for external services [puppet] - 10https://gerrit.wikimedia.org/r/247905 (https://phabricator.wikimedia.org/T114059) [20:14:18] !log starting copy of elasticsearch indices from eqiad cluster to labsearch cluster [20:14:21] YuviPanda: ^^ [20:14:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:16:22] !log Started running 8 threads of commonswiki refreshlinks jobs on terbium [20:16:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:18:20] 6operations, 5Patch-For-Review: ssl expiry tracking in icinga - we don't monitor that many domains - https://phabricator.wikimedia.org/T114059#1743688 (10Dzahn) archiva: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=titanium&service=HTTPS ganglia: https://icinga.wikimedia.org/cgi-bin/icin... [20:19:11] 6operations, 6Analytics-Kanban, 7Monitoring, 5Patch-For-Review: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#1743691 (10Ottomata) Rats, I see some segfaults: Oct 20 03:40:14 cp4014 kernel: [13259932.218753] diamond[10119]: segfault at 7fc4953d4620 ip 00007fc4955de52d sp 00007fffc99561d0... [20:26:15] 6operations, 10Beta-Cluster-Infrastructure, 7WorkType-NewFunctionality: etcd/confd is not started on deployment-cache-mobile04 - https://phabricator.wikimedia.org/T116224#1743724 (10hashar) 3NEW [20:43:24] (03CR) 10Yurik: "how much money are we talking about?" [dns] - 10https://gerrit.wikimedia.org/r/244078 (owner: 10Dzahn) [20:46:00] !log cancel copying elasticsearch eqiad to labsearch, looks to be writing to wrong disks and will fill up [20:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:48:20] (03CR) 10Dzahn: "maybe Robh can chime in on that question. i don't really know. maybe some day we can use letsencrypt." [dns] - 10https://gerrit.wikimedia.org/r/244078 (owner: 10Dzahn) [20:48:53] mutante: i totally cannot put our pricing in a gerrit patchet. [20:49:04] (03CR) 10Alexandros Kosiaris: "I doubt it is worth it. We are already in the process of creating a new service, I see no point in associating it with an older, currently" [dns] - 10https://gerrit.wikimedia.org/r/244078 (owner: 10Dzahn) [20:49:28] robh: ah, ok, *nod* [20:49:41] though there is a longstanding issue in phab for that now [20:49:51] the entire redirection cluster for all the other domains [20:49:57] i'd just point them at that for any of those quesitons [20:50:17] i think i know which one you mean, "domain list" etc. yea [20:50:43] yea looking for it [20:50:44] sigh, I hate how google is unifying all of it's APIs [20:51:08] https://phabricator.wikimedia.org/T101048 [20:51:09] thank god they realized that oauth is not making sense for the safe browing lookup API at least [20:51:21] (03CR) 10Yurik: "ok, but i think we should keep the domain - Discovery team might get some interesting ideas about its usage in the future" [dns] - 10https://gerrit.wikimedia.org/r/244078 (owner: 10Dzahn) [20:52:12] (03CR) 10RobH: [C: 031] "There is a longstanding question on how to handle all the various domains and redirections. I think having the discussion for this partic" [dns] - 10https://gerrit.wikimedia.org/r/244078 (owner: 10Dzahn) [20:52:20] mutante: i chimed in so you arent the only one =] [20:52:55] basically i think deactivating it is better than leaving it sit without proper use [20:53:09] and that deciding on a cert for it is outside the scope of the patchset, since it is a policy level decision [20:53:33] (03CR) 10RobH: "I meant to link the phab task for said discussion on policy: https://phabricator.wikimedia.org/T101048" [dns] - 10https://gerrit.wikimedia.org/r/244078 (owner: 10Dzahn) [20:56:43] 10Ops-Access-Requests, 6operations, 5Patch-For-Review: Requesting access to stat1003, stat1002 and bast1001 for AGomez (WMF) - https://phabricator.wikimedia.org/T115666#1743863 (10atgo) Yes! Here's the contents of the file: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC+rFSx3I00yhYMJzgg3IthAqLBEnE9nab3DF9l+QKT2pW... [20:59:29] (03CR) 10MaxSem: "When I wrote "Discovery is fine with this" above this means that I ran this past Tomasz and he said kill it. We don't need to redirect thi" [dns] - 10https://gerrit.wikimedia.org/r/244078 (owner: 10Dzahn) [21:10:53] !log starting copy of elasticsearch eqiad indices to codfw [21:10:57] chasemp: ^ [21:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [21:11:16] (03PS1) 10Dzahn: icinga: ssl cert checks for external services,pt.2 [puppet] - 10https://gerrit.wikimedia.org/r/247919 [21:11:43] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK: OK: No anomaly detected [21:12:56] (03CR) 10Dzahn: [C: 032] icinga: ssl cert checks for external services,pt.2 [puppet] - 10https://gerrit.wikimedia.org/r/247919 (owner: 10Dzahn) [21:16:04] 6operations, 10Traffic, 6WMF-Legal: Policy decisions for new (and current) DNS domains registered to the WMF - https://phabricator.wikimedia.org/T101048#1743937 (10Slaporte) @Dzahn, @BBlack, this document is hereby declared master: https://docs.google.com/spreadsheets/d/1nmu60y1gkvc9NrvG0uPCS4wI9jfdCJIvwXVEv... [21:18:04] PROBLEM - nutcracker port on silver is CRITICAL: CRITICAL - Socket timeout after 2 seconds [21:21:53] RECOVERY - nutcracker port on silver is OK: TCP OK - 0.000 second response time on port 11212 [21:26:44] 6operations, 10Traffic, 6WMF-Legal: Policy decisions for new (and current) DNS domains registered to the WMF - https://phabricator.wikimedia.org/T101048#1743954 (10Dzahn) Ok, cool. I'll keep using that and already making some updates, like for wikipedia.lol :p [21:31:53] 6operations, 10Traffic, 6WMF-Legal: Policy decisions for new (and current) DNS domains registered to the WMF - https://phabricator.wikimedia.org/T101048#1743966 (10Dzahn) P.S. I found wilkipedia.org, wikidpedia.org and wikipedial.org have to be added and the master list is not complete yet. So more than 675 :) [21:43:51] (03PS4) 10Dzahn: eventdonations: add SSL cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247907 [21:44:38] (03CR) 10Dzahn: [C: 032] eventdonations: add SSL cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247907 (owner: 10Dzahn) [21:51:28] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 8.33% of data above the critical threshold [500.0] [21:52:36] (03PS3) 10Dzahn: toolserver.org - add SSL cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247908 [21:53:23] (03CR) 10Dzahn: [C: 032] toolserver.org - add SSL cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247908 (owner: 10Dzahn) [21:56:56] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:59:12] (03PS4) 10Dzahn: planet: add ssl cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247909 [22:00:11] (03CR) 10Dzahn: [C: 032] planet: add ssl cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247909 (owner: 10Dzahn) [22:01:01] a bit of a long shot...but is there something with more cpu than terbium i can use to do index copies for elasticsearch? There are 31 servers on one side, and 24 servers on the other side, the bottleneck is i have to run the php script from one machine (currently using 8 threads, out of 12 core on terbium) [22:02:44] ebernhardson: there's potentially spares you can use... But I don't know teh turnaround time [22:02:46] robh: ^^ [22:03:18] maybe this: [22:03:23] 2149 # Test server for labs ElasticSearch replication [22:03:23] 2150 node 'nobelium.eqiad.wmnet' { [22:03:41] thatdoesn't have mediawiki [22:03:53] its a separate cluster in labs that is also to be copied to (After i get the codfw copy worked out) [22:04:18] using the term "cluster" loosly...but if we could get mediawiki deployed there through puppet it would work [22:04:21] yea, so what Reedy said, you can request a temp. "spare" server [22:04:34] where did ebernhardson say temp? [22:04:39] we have a few in site.pp with "role spare" currently [22:04:57] applying mediawiki stuff to it wouldn't be an issue (ie puppet), then just sync-common [22:05:01] and im not the decider, merely the facilitator [22:05:16] robh: i didn,t but reedy suggested it might be possible. Basically i'm copying from an 800 core cluster to another 800 core cluster, but the bottleneck is i have to copy it all through one machine (using 8 threads on terbium currently) [22:05:28] ebernhardson: is it temp? [22:05:34] yes, its just to initialize the cluster [22:05:40] so its a dual cpu with 32gb [22:05:52] its as highly rated a misc server as we really have in terms of cpu and memory [22:05:56] 6operations, 10Analytics, 6Discovery, 10EventBus, and 7 others: EventBus MVP - https://phabricator.wikimedia.org/T114443#1744036 (10GWicke) We are having a hangout meeting tomorrow (Thursday, 22nd) between 11&12am SF time. Please let us know if you'd like to join. [22:06:10] ok, well it was worth asking at least. It can run for however long it takes [22:06:25] so yea, anything faster would mean buying something [22:06:35] or taking a system from another allocation [22:07:07] ebernhardson: if terbium is overloaded by having to do that AND its other jobs thats another reason to allocate an additional server though [22:07:21] but know it'll be one of similar spec. [22:07:25] protactinium is not being used but in puppet and an R610 [22:07:29] (if its a spare) [22:07:31] mutante: thats slower [22:07:32] i can cut down the # of procs, should i note that anywhere? this might run for a few days at current speeds [22:07:38] i'll email ops just so its known [22:07:42] robh: ugh, ok [22:07:46] log it too for ease [22:07:50] I guess on another server, he could use more threads etc [22:07:52] i logged it when starting [22:08:12] thats what i mean, if terbiums shared load is an issue then yea a request for hardware is reasonable [22:08:41] ebernhardson: https://wikitech.wikimedia.org/wiki/Operations_requests#Hardware_Requests details how to file the request. Its typically best if you do it, rather than I do it takign what you said in irc, etc... [22:08:58] then we can allocae you a system as needed/approved. i typically discuss allocations with mark daily. [22:09:24] just know the onsite spares i have arent faster than terbium, it would merely be a nonshared system. [22:09:46] ok [22:10:04] (so if you got in the request today, i'd review it with mark tomorrow for approvals.) [22:10:07] =] [22:11:05] reading now :) [22:11:19] PROBLEM - Host eventdonations.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [22:11:33] liessss [22:11:41] (its not down) [22:12:15] this is a third party hosted site right? [22:12:41] yep [22:12:48] ACKNOWLEDGEMENT - Host eventdonations.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn FR host doesnt want to ICMP? HTTPS works [22:12:51] seems to host outside us [22:12:57] yes, so https works there [22:13:06] it's just that it doesnt want to talk ICMP with us [22:13:16] so that makes icinga think the "host" is down [22:13:25] while the service https on it is up [22:13:54] that's what i'm doing right now, adding checks for the external services [22:14:13] "SSL OK - Certificate eventdonations.wikimedia.org valid until 2016-08-04 12:10:02 +0000 " [22:14:36] reason: fundraising firewall .. i expect [22:15:05] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=eventdonations.wikimedia.org&service=HTTPS-eventdonations [22:17:12] 6operations, 5Patch-For-Review: ssl expiry tracking in icinga - we don't monitor that many domains - https://phabricator.wikimedia.org/T114059#1744082 (10Dzahn) policy: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=policy.wikimedia.org blog: https://icinga.wikimedia.org/cgi-bin/icinga/ext... [22:19:51] https://etherpad.wikimedia.org/p/T114059 [22:24:48] 6operations, 6Discovery: Fix CirrusSearch monitoring - https://phabricator.wikimedia.org/T84163#1744098 (10chasemp) poked erik a bit who will chime in, sounds like it will sit for awhile and is not a fire [22:27:32] (03PS1) 10Ori.livneh: grafana: automate the creation of the Anonymous user [puppet] - 10https://gerrit.wikimedia.org/r/247925 [22:27:49] 6operations, 6Discovery: Fix CirrusSearch monitoring - https://phabricator.wikimedia.org/T84163#1744116 (10EBernhardson) I've gone back to review the size of the logs over the last month, it looks to have settled down from some of the worst cases back on aug 29. Figuring out what constitues a problem and what... [22:27:51] 6operations, 6Discovery, 10Maps: Capitalise "maps Cluster codfw" ganglia group - https://phabricator.wikimedia.org/T116234#1744117 (10Reedy) 3NEW [22:30:29] (03PS1) 10Dzahn: wmflabs.org - add ssl cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247926 [22:31:42] bd808: https://gerrit.wikimedia.org/r/#/c/247925/ [22:31:47] (tested on krypton) [22:31:51] 6operations, 6Discovery, 10Maps: Capitalise "maps Cluster codfw" ganglia group - https://phabricator.wikimedia.org/T116234#1744144 (10Dzahn) interestingly it's: hieradata/common.yaml: name: "maps Cluster" and the "common" part should mean it's not different for codfw and eqiad. odd. [22:32:17] 6operations, 10hardware-requests: Site: 1 server hardware access request for initializing the codfw elasticsearch cluster. - https://phabricator.wikimedia.org/T116236#1744145 (10EBernhardson) 3NEW [22:32:53] (03PS2) 10Dzahn: wmflabs.org - add ssl cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247926 [22:33:05] (03PS2) 10Ori.livneh: grafana: automate the creation of the Anonymous user [puppet] - 10https://gerrit.wikimedia.org/r/247925 [22:33:24] (03CR) 10Dzahn: [C: 032] wmflabs.org - add ssl cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247926 (owner: 10Dzahn) [22:33:24] ori: hardcoded to your homedir? [22:33:37] ah, fixed now [22:33:37] bd808: yeah, fixed in ps2 [22:33:42] sorry :P [22:38:11] (03CR) 10BryanDavis: [C: 031] "Not tested, but the code looks good and the idea is much nicer than requiring manual setup for a new instance." [puppet] - 10https://gerrit.wikimedia.org/r/247925 (owner: 10Ori.livneh) [22:38:21] bd808: thanks! :) [22:38:28] (03PS3) 10Ori.livneh: grafana: automate the creation of the Anonymous user [puppet] - 10https://gerrit.wikimedia.org/r/247925 [22:38:44] (03CR) 10Ori.livneh: [C: 032 V: 032] grafana: automate the creation of the Anonymous user [puppet] - 10https://gerrit.wikimedia.org/r/247925 (owner: 10Ori.livneh) [22:44:55] 6operations, 6Analytics-Backlog, 5Patch-For-Review: erbium (logging) - useradd: group '30001' does not exist - https://phabricator.wikimedia.org/T115943#1744196 (10chasemp) 5Open>3Resolved This has been resolved with an exception allowing the file_mover UID [22:47:16] (03PS1) 10Dzahn: ldap: add ssl cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247929 [22:49:30] (03PS2) 10Dzahn: ldap: add ssl cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247929 [22:50:11] (03CR) 10Dzahn: [C: 032] ldap: add ssl cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247929 (owner: 10Dzahn) [22:51:57] (03PS2) 10Rush: Remove trebuchet user from wikidev group [puppet] - 10https://gerrit.wikimedia.org/r/247721 (https://phabricator.wikimedia.org/T115760) (owner: 10Thcipriani) [22:52:05] (03CR) 10Rush: [C: 032 V: 032] Remove trebuchet user from wikidev group [puppet] - 10https://gerrit.wikimedia.org/r/247721 (https://phabricator.wikimedia.org/T115760) (owner: 10Thcipriani) [22:53:22] (03PS1) 10Dzahn: ldap: top-scope variable, mini lint fix [puppet] - 10https://gerrit.wikimedia.org/r/247930 [22:53:27] (03CR) 10jenkins-bot: [V: 04-1] ldap: top-scope variable, mini lint fix [puppet] - 10https://gerrit.wikimedia.org/r/247930 (owner: 10Dzahn) [22:53:34] (03PS2) 10Dzahn: ldap: top-scope variable, mini lint fix [puppet] - 10https://gerrit.wikimedia.org/r/247930 [22:54:45] (03CR) 10Dzahn: [C: 032] "fixes wrong resource name and 1 x "top-scope variable being used without an explicit namespace"" [puppet] - 10https://gerrit.wikimedia.org/r/247930 (owner: 10Dzahn) [22:55:52] is there something wrong with the AQS LVS? [22:56:12] mutante: ^^^ (since you're on duty) [22:56:18] :P [22:56:30] not that i know of, monitoring looks ok [22:56:38] i think i saw some changes earlier though. looking [22:56:57] 6operations, 6Release-Engineering-Team, 5Patch-For-Review: deployment: user trebuchet gets added and removed from group wikidev on every puppet run - https://phabricator.wikimedia.org/T115760#1744246 (10chasemp) 5Open>3Resolved done uid=995(trebuchet) gid=10004(trebuchet) groups=10004(trebuchet) [22:57:16] mutante: ping fails to aqs.svc.eqiad.wmnet from restbase1001 [22:57:29] pick up the correct ip for it though [22:57:42] so maybe it's lvs acting up a bit? [23:00:04] RoanKattouw ostriches rmoen Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20151021T2300). [23:00:39] !log ori@tin Synchronized php-1.27.0-wmf.2/extensions/AbuseFilter/AbuseFilter.parser.php: Ad-hoc debug logging of AbuseFilter exceptions (duration: 00m 17s) [23:00:40] ebernhardson, would you mind doing swat today? [23:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:02:59] Krenair: sure i can do that [23:03:05] thanks [23:03:06] !log ori@tin Synchronized php-1.27.0-wmf.2/extensions/AbuseFilter/AbuseFilter.parser.php: Ad-hoc debug logging of AbuseFilter exceptions (duration: 00m 17s) [23:03:13] Unless ori is volunteering [23:03:16] Hmm, no ping from jouncebot for me? [23:03:32] Anyway, I'm here. [23:04:01] * Krenair kicks morebots [23:05:47] mobrovac: i dont find it in pybal config, i'm trying to find it in etcd with conftool now [23:05:52] (03CR) 10EBernhardson: [C: 032] Add throttle exception for eswiki at 2015-10-23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247872 (https://phabricator.wikimedia.org/T116183) (owner: 10Luke081515) [23:05:59] it see it just got introduced not long ago [23:06:00] (03Merged) 10jenkins-bot: Add throttle exception for eswiki at 2015-10-23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/247872 (https://phabricator.wikimedia.org/T116183) (owner: 10Luke081515) [23:07:19] mobrovac: i dont see it in ./conftool-data/ either.. sure it was already done? [23:07:50] besides the DNS entry it is maybe just not ready [23:08:35] mutante: i see it in the role - https://github.com/wikimedia/operations-puppet/blob/production/hieradata/role/common/aqs.yaml#L52 [23:08:41] so i assumed it's up and running [23:08:45] James_F: around to test patch? [23:08:58] !log ebernhardson@tin Synchronized wmf-config/throttle.php: Add throtle exception for eswiki (duration: 00m 18s) [23:09:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:09:14] mobrovac: so i know of 2 places where the config would be but besides that it got added to DNS i dont see any.. i asked [23:09:30] ebernhardson: Still. :-) [23:09:33] sweet [23:09:43] going out whenever jenkins is happy... [23:10:21] 6operations, 10Deployment-Systems, 5Patch-For-Review: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1744283 (10chasemp) [23:10:22] 6operations, 10Deployment-Systems: errors reported by "eventual_consistency_deployment_server_init" on new deploy server - https://phabricator.wikimedia.org/T99928#1744280 (10chasemp) 5Open>3Resolved a:3chasemp this is no longer happening [23:10:25] mutante: kk thnx, i'll file a task and bug akosiaris about it tmrrw [23:11:01] mobrovac: perfect, thanks [23:11:28] 6operations, 5Patch-For-Review: Ferm rules for tin/mira - https://phabricator.wikimedia.org/T113351#1744285 (10chasemp) I see rules on mira but not tin. what's the intended status? [23:12:24] ebernhardson: So, tomorrow? :-) [23:13:20] James_F: no, if it takes that long i'm canceling swat :P [23:15:21] * James_F grins. [23:15:23] You're no fun. [23:16:11] 6operations, 5Patch-For-Review: Ferm rules for tin/mira - https://phabricator.wikimedia.org/T113351#1744305 (10Dzahn) >>! In T113351#1744285, @chasemp wrote: > I see rules on mira but not tin. what's the intended status? see comments by Moritz: //But we should coordinate before applying this on tin: Add it... [23:16:53] !log ebernhardson@tin Synchronized php-1.27.0-wmf.3/resources/: mw.ForeignStructuredUpload: Rearrange messages to always display license name (duration: 00m 18s) [23:16:55] James_F: wmf.3 ^^ [23:17:01] Yup. [23:17:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:17:50] * James_F waits for cache. [23:19:16] 6operations, 10Deployment-Systems, 5Patch-For-Review: install/deploy mira as codfw deployment server - https://phabricator.wikimedia.org/T95436#1744308 (10Dzahn) so, additional confirmation that the rules on mira look like everything that is needed on tin would not hurt. or we could try deploying from mira,... [23:19:24] 6operations, 5Patch-For-Review: install/setup/deploy server rhodium as puppetmaster (scaling out) - https://phabricator.wikimedia.org/T98173#1744309 (10RobH) @akosiaris: Is there still a need for this system in this role, or can I reclaim into spares? [23:19:42] ebernhardson: LGTM. [23:20:02] 6operations, 5Patch-For-Review: Ferm rules for tin/mira - https://phabricator.wikimedia.org/T113351#1744310 (10Dzahn) so, additional confirmation that the rules on mira look like everything that is needed on tin would not hurt. or we could try deploying from mira, one way or the other around [23:20:51] still waiting on wmf.2 to merge [23:21:17] 6operations, 10Analytics, 6Services: Set up LVS for AQS - https://phabricator.wikimedia.org/T116245#1744311 (10mobrovac) 3NEW [23:21:23] James_F: err..i read the deploy calendar wrong. the other patch is wmf.3 as well. so only the first is deployed, the second is still waiting on jenkins :( [23:21:45] 6operations, 5Patch-For-Review: ssl expiry tracking in icinga - we don't monitor that many domains - https://phabricator.wikimedia.org/T114059#1744319 (10Dzahn) planet: https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=1&host=en.planet.wikimedia.org labs / tools.wmflabs.org https://icinga.wikimedia... [23:21:47] ebernhardson: Ha, that makes sense. [23:22:42] 6operations, 10Analytics, 6Services: Set up LVS for AQS - https://phabricator.wikimedia.org/T116245#1744321 (10mobrovac) [23:23:52] 6operations, 5Patch-For-Review: ssl expiry tracking in icinga - we don't monitor that many domains - https://phabricator.wikimedia.org/T114059#1744322 (10Dzahn) ok, all done, except these remnants. can you help me here? what's the status? ecc-star.wmfusercontent.org.crt labvirt-star.eqiad.wmnet.crt ldap-m... [23:25:50] !log ebernhardson@tin Synchronized php-1.27.0-wmf.3/resources/: mw.ForeignStructuredUpload: Provide category suggestions from the right wiki (duration: 00m 17s) [23:25:51] James_F: ^ [23:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:25:59] * James_F rechecks. :-) [23:26:09] (03PS1) 10Mobrovac: RESTBase: Set up MobileApps storage and AQS public API [puppet] - 10https://gerrit.wikimedia.org/r/247935 (https://phabricator.wikimedia.org/T114830) [23:26:20] 7Puppet, 10Deployment-Systems, 6Release-Engineering-Team, 10Salt, 10Staging: provider => trebuchet doesn't work until manual 'git deploy start' on deployment-server - https://phabricator.wikimedia.org/T92978#1744327 (10chasemp) [23:27:44] 6operations, 6Discovery, 10Wikimedia-Logstash, 7Elasticsearch, 7Graphite: Deploy statsd plugin for production elasticsearch & logstash - https://phabricator.wikimedia.org/T90889#1744329 (10chasemp) 5Open>3Resolved a:3chasemp I think we are fairly happy T111573 for now (and that can be extended) [23:28:01] (03CR) 10Mobrovac: [C: 04-1] "Superseded by I17ae36660ebb374e7062cd1e4ad4634ffddf66a7 which integrates some additional changes we need to make in the next config change" [puppet] - 10https://gerrit.wikimedia.org/r/245887 (https://phabricator.wikimedia.org/T114830) (owner: 10Milimetric) [23:29:37] 6operations, 6Discovery, 7Elasticsearch: unattended elasticsearch restarts - https://phabricator.wikimedia.org/T89845#1744341 (10chasemp) 5Open>3Resolved I'm not sure what this ticket entails at this point. We know we have some issues with mass cluster update and I imagine those get dealt with in T109089... [23:29:56] (03CR) 10Mobrovac: [C: 04-1] "Blocked by T116245 , -1'ing till that is resolved." [puppet] - 10https://gerrit.wikimedia.org/r/247935 (https://phabricator.wikimedia.org/T114830) (owner: 10Mobrovac) [23:30:06] (03Abandoned) 10Milimetric: Add a public endpoint for AQS [puppet] - 10https://gerrit.wikimedia.org/r/245887 (https://phabricator.wikimedia.org/T114830) (owner: 10Milimetric) [23:32:30] !log ebernhardson@tin Synchronized php-1.27.0-wmf.3/extensions/CirrusSearch: Performance tweaks for corss-dc copy process (duration: 00m 18s) [23:32:31] 7Puppet, 6operations, 5Patch-For-Review: Make Puppet repository pass lenient and strict lint checks - https://phabricator.wikimedia.org/T87132#1744351 (10chasemp) it seems like this is working now...but the blocked by tasks are not closed? [23:32:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:32:59] 6operations, 7Graphite, 7audits-data-retention: graphite-web logs are not rotated - https://phabricator.wikimedia.org/T86546#1744352 (10chasemp) [23:33:44] !log ebernhardson@tin Synchronized php-1.27.0-wmf.2/extensions/CirrusSearch: Performance tweaks for corss-dc copy process (duration: 00m 19s) [23:33:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:33:53] all SWAT patches are now deploye [23:33:55] d [23:39:44] (03PS1) 10Dzahn: openldap: add SSL cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247936 [23:41:05] 6operations, 5Patch-For-Review: ssl expiry tracking in icinga - we don't monitor that many domains - https://phabricator.wikimedia.org/T114059#1744391 (10Dzahn) skip ldap-mirror. that should be this https://gerrit.wikimedia.org/r/#/c/247936/1/manifests/role/openldap.pp [23:41:46] (03PS2) 10Dzahn: openldap: add SSL cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247936 [23:42:11] (03CR) 10Dzahn: [C: 032] openldap: add SSL cert expiry check [puppet] - 10https://gerrit.wikimedia.org/r/247936 (owner: 10Dzahn) [23:46:08] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1744407 (10GWicke) 3NEW [23:46:17] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1744407 (10GWicke) p:5Normal>3High [23:48:12] 6operations, 6Analytics-Kanban, 7Monitoring, 5Patch-For-Review: Overhaul reqstats - https://phabricator.wikimedia.org/T83580#1744422 (10chasemp) @ottomata ping me tomorrow if you can, I'd like to give helping a whirl [23:48:23] 6operations, 6Phabricator, 7audits-data-retention: Enable mod_remoteip on Phabricator and ensure logs follow retention guidelines - https://phabricator.wikimedia.org/T114014#1744423 (10mmodell) There are now retention policy options that we can configure in the phabricator garbage collectors: https://phabri... [23:49:54] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1744425 (10GWicke) [23:50:51] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1744407 (10GWicke) [23:51:19] 6operations, 7Privacy, 7audits-data-retention: Gerrit seemingly violates data retention guidelines - https://phabricator.wikimedia.org/T114395#1744429 (10csteipp) Let's purge those [23:52:17] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Define edit related events for change propagation - https://phabricator.wikimedia.org/T116247#1744440 (10GWicke) [23:52:25] 6operations: Have sane syslog logging - https://phabricator.wikimedia.org/T82287#1744442 (10chasemp) 5Open>3Resolved a:3chasemp I'm only closing this as it seems like it has no clear actionable except logstashing and general log management and that is now falling under #wikimedia-logstash [23:52:44] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Reliable publish / subscribe event bus - https://phabricator.wikimedia.org/T84923#1744446 (10GWicke) [23:52:52] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 10hardware-requests: Assign 3 more servers to video scaler duty - https://phabricator.wikimedia.org/T114337#1744447 (10RobH) I don't think the application cluster is too busy, and we should be able to snag one from each row and append into the range. I'... [23:53:36] 6operations, 10Analytics, 6Discovery, 10EventBus, and 6 others: Reliable publish / subscribe event bus - https://phabricator.wikimedia.org/T84923#933968 (10GWicke) [23:57:29] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [23:57:51] :) mira activity [23:58:05] heh... that's probably been alerting since like july [23:59:25] I'm running "SSH_AUTH_SOCK=/run/keyholder/proxy.sock rsync -ar mwdeploy@tin:/srv/mediawiki-staging /srv" there