[00:02:01] mutante: mergeing yo thang [00:02:22] andrewbogott: thanks! [00:02:30] MaxSem: thanks! [00:02:34] YuviPanda: https://wikitech.wikimedia.org/w/api.php?action=query&list=novainstances&niproject=deployment-prep&niregion=eqiad&format=jsonfm :D [00:02:46] legoktm: I see the merge as still underway… it's up and working already? [00:02:58] yup [00:03:17] Ah, there it finally returned [00:09:58] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 220, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [00:12:58] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 [00:13:38] PROBLEM - puppet last run on cp3009 is CRITICAL: CRITICAL: puppet fail [00:19:08] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 220, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [00:28:26] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 [00:29:59] PROBLEM - puppet last run on db2038 is CRITICAL: CRITICAL: Puppet has 2 failures [00:32:10] RECOVERY - puppet last run on cp3009 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [00:47:03] !log ran updateArticleCount.php --wiki=ckbwiki (bug 71884) [00:47:08] Logged the message, Master [00:47:22] RECOVERY - puppet last run on db2038 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [00:48:10] legoktm: Should also be done for Wikidata [00:48:19] but not sure the thing scales fpr 16M pages [00:48:31] is Wikidata importing lots of things? [00:48:49] legoktm: Nope, but the count is of anyway AFAIR [00:48:52] by quite a bit [00:48:53] hmm [00:48:57] might as well [00:49:09] edit counts is off by more than a million by now I think :S [00:49:12] * count [00:49:48] I'll probably try to get the edit count right before we hit 200M :P [00:50:38] oh [00:50:41] I'll use the other script then [00:50:52] initSiteStats.php [00:51:18] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 313 seconds [00:51:19] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 318 seconds [00:53:19] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -0 seconds [00:53:20] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [00:53:26] !log running initSiteStats.php on wikidatawiki [00:53:33] Logged the message, Master [00:59:05] (03PS2) 10Dzahn: Remove squid monitoring from torrus [puppet] - 10https://gerrit.wikimedia.org/r/164274 (owner: 10Hoo man) [01:02:22] hm, I guess that this script isn't really going to work because it'll get out of date while it's running [01:02:44] (03PS3) 10Dzahn: Remove squid monitoring from torrus [puppet] - 10https://gerrit.wikimedia.org/r/164274 (owner: 10Hoo man) [01:04:55] (03CR) 10Dzahn: [C: 032] "removed the dependency on I9164f99c332036 , rebased" [puppet] - 10https://gerrit.wikimedia.org/r/164274 (owner: 10Hoo man) [01:11:25] (03CR) 10Dzahn: "deleted /etc/torrus/xmlconfig/squid.xml from netmon1001 like it said in the message" [puppet] - 10https://gerrit.wikimedia.org/r/164274 (owner: 10Hoo man) [01:13:01] (03PS3) 10Dzahn: Remove all references to pmtpa from role::cache [puppet] - 10https://gerrit.wikimedia.org/r/164273 (owner: 10Hoo man) [01:13:53] (03CR) 10Dzahn: "Hoo man, alright, i removed the dependency between the 2 patches and merged the other one, then rebased this one, so now it doesn't touch " [puppet] - 10https://gerrit.wikimedia.org/r/164273 (owner: 10Hoo man) [01:15:09] (03CR) 10Dzahn: [C: 032] fix typo in static yaml phab priority settings file [puppet] - 10https://gerrit.wikimedia.org/r/165911 (owner: 10Christopher Johnson (WMDE)) [01:19:22] (03CR) 10Dzahn: "you say there is no Ganglia in labs, then what is http://ganglia.wmflabs.org/ ?" [puppet] - 10https://gerrit.wikimedia.org/r/165360 (owner: 10Yuvipanda) [01:19:51] PROBLEM - ElasticSearch health check on elastic1017 is CRITICAL: CRITICAL - Could not connect to server 10.64.48.39 [01:19:58] (03CR) 10Dzahn: [C: 04-1] Include ganglia in standard only for production [puppet] - 10https://gerrit.wikimedia.org/r/165360 (owner: 10Yuvipanda) [01:20:03] (03CR) 10Yuvipanda: "empty and dead? :)" [puppet] - 10https://gerrit.wikimedia.org/r/165360 (owner: 10Yuvipanda) [01:20:24] mutante: it's empty and dead, and if we try to put any data in it it dies pretty quickly. [01:20:32] labs instances aren't big enough for ganglia... [01:20:42] and our ganglia code isn't structured in a way we can split it up... [01:20:59] Krinkle wrote https://tools.wmflabs.org/nagf/?project=tools to give summary graphs from graphite... [01:22:01] (03CR) 10Yuvipanda: "It's empty and not dead now, but if we send data to it, it dies pretty quickly." [puppet] - 10https://gerrit.wikimedia.org/r/165360 (owner: 10Yuvipanda) [01:22:37] <^d> ES health is ok, I'm doing work on elastic1017 and 1018. [01:22:52] <^d> s/ES/Elasticsearch/ [01:25:14] (03PS2) 10Dzahn: contint: Sikuli is no longer used anywhere [puppet] - 10https://gerrit.wikimedia.org/r/165204 (https://bugzilla.wikimedia.org/54393) (owner: 10Zfilipin) [01:27:04] (03CR) 10Dzahn: [C: 032] "i don't see this package anymore, neither on gallium nor on lanthanum" [puppet] - 10https://gerrit.wikimedia.org/r/165204 (https://bugzilla.wikimedia.org/54393) (owner: 10Zfilipin) [01:30:02] (03CR) 10Dzahn: [C: 031] Install logstash-contrib too (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/165503 (owner: 10Reedy) [01:30:30] (03PS1) 10Yuvipanda: nagios_common: Allow content to be set for command config [puppet] - 10https://gerrit.wikimedia.org/r/165951 [01:31:09] (03CR) 10jenkins-bot: [V: 04-1] nagios_common: Allow content to be set for command config [puppet] - 10https://gerrit.wikimedia.org/r/165951 (owner: 10Yuvipanda) [01:31:50] (03PS2) 10Yuvipanda: nagios_common: Allow content to be set for command config [puppet] - 10https://gerrit.wikimedia.org/r/165951 [01:32:13] (03CR) 10Dzahn: [C: 032] "ii logstash 1.4.2-1-2c0f5a1 An extensible logging pipeline" [puppet] - 10https://gerrit.wikimedia.org/r/165503 (owner: 10Reedy) [01:34:07] * Deskana sounds the product manager alarm. [01:34:36] So there was an outage today that caused a gigantic amount of users of the mobile apps to get an experience with no CSS. [01:35:03] Is there monitoring to make sure that the appropriate people are alerted next time this happens? [01:35:34] <^d> There's lots of monitoring. What should've been monitored that wasn't? [01:35:54] * Deskana is not an engineer. [01:36:02] Perhaps YuviPanda knows the answer to that. [01:36:08] (03CR) 10Dzahn: "i don't know how, but this was already installed anyways - so this is noop" [puppet] - 10https://gerrit.wikimedia.org/r/165503 (owner: 10Reedy) [01:36:18] hmm [01:36:26] well, one possibility (that happened last time) is bits outage [01:36:32] that's already heavily monitored... [01:36:34] YuviPanda: fwiw, i removed the -1 [01:36:51] Okay, so there is monitoring. Cool. :) [01:36:57] <^d> I don't see anything about bits in today's SAL. [01:36:58] Deskana: to protect against today's outage, we'd have to monitor by hitting the URL and checking if it doesn't have a particular string [01:37:17] ^d: nope, not bits. MobileApp extension referred to a file in MobileFrontend which was moved, and hence that one module broke... [01:37:35] Deskana: if you file a bug somewhere and assign it to me, I'll set up an appropriate alert in icinga for the current situation. [01:37:55] YuviPanda: We've got patches for the apps that catch the specific error that happened today and avoid overwriting the CSS if it happens again, but that's suboptimal. [01:38:29] Deskana: well, defence in depth, etc :) [01:38:30] PROBLEM - puppet last run on mw1128 is CRITICAL: CRITICAL: Puppet has 1 failures [01:38:56] Deskana: still, do file a bug :) [01:38:59] and cc me... [01:39:04] * Deskana notes that the spelling "defence" is used in en-gb but not en-us ;) [01:39:06] YuviPanda: Which product/component should that go in? [01:39:12] Deskana: hmm, mobile app, I think [01:39:18] opsy stuff isn't well represented in bugzilla [01:39:21] Deskana: and assign it to me... [01:40:02] Well, Extension:MobileApp doesn't have a component so I'll stick it in Apps/General [01:42:23] (03PS2) 10Dzahn: Make gerrit set PATCH_TO_REVIEW status only in bugzilla [puppet] - 10https://gerrit.wikimedia.org/r/164881 (owner: 10QChris) [01:42:26] YuviPanda: Done. Thank you. :) [01:42:33] Deskana: yw. it does have a component, though. [01:42:37] mutante: thanks :) [01:42:55] * Deskana stops the sounding of the product manager alarm. [01:42:55] YuviPanda: Then I am blind. [01:42:55] * Deskana checks if he is blind. [01:43:12] Deskana: it's either in product mobile apps, or in mediawiki extensions product [01:43:24] * Deskana facedesks. [01:43:53] There. [01:43:59] Thanks again! [01:44:55] (03CR) 10Dzahn: [C: 04-1] "sync-apache can be removed, but apache-fast-test is still a useful test script that should stay on tin/terbium imho" [puppet] - 10https://gerrit.wikimedia.org/r/164508 (owner: 10Ori.livneh) [01:45:36] (03CR) 10Dzahn: [C: 032] Make gerrit set PATCH_TO_REVIEW status only in bugzilla [puppet] - 10https://gerrit.wikimedia.org/r/164881 (owner: 10QChris) [01:47:56] (03PS1) 10Yuvipanda: icinga: Separate smtp configuration into own file [puppet] - 10https://gerrit.wikimedia.org/r/165955 [01:47:58] (03PS1) 10Yuvipanda: icinga: Split mysql related config entries into separate file [puppet] - 10https://gerrit.wikimedia.org/r/165956 [01:48:00] (03PS1) 10Yuvipanda: nagios_common: Move checkcommands.cfg into module, from icinga [puppet] - 10https://gerrit.wikimedia.org/r/165957 [01:53:26] (03CR) 10Dzahn: [C: 031] "that's correct, 'relay' doesn't exist in DNS, 'mail' does" [puppet] - 10https://gerrit.wikimedia.org/r/164716 (https://bugzilla.wikimedia.org/71634) (owner: 10Tim Landscheidt) [01:53:33] (03PS2) 10Dzahn: Tools: Fix hostname in EHLO [puppet] - 10https://gerrit.wikimedia.org/r/164716 (https://bugzilla.wikimedia.org/71634) (owner: 10Tim Landscheidt) [01:53:40] (03CR) 10Dzahn: [C: 032] Tools: Fix hostname in EHLO [puppet] - 10https://gerrit.wikimedia.org/r/164716 (https://bugzilla.wikimedia.org/71634) (owner: 10Tim Landscheidt) [01:57:01] (03CR) 10Dzahn: [C: 04-1] "what Jan said, we can add placeholder files or self-signed certs to labs/private to work around this, that way we can still test in labs" [puppet] - 10https://gerrit.wikimedia.org/r/164103 (owner: 10Yuvipanda) [01:57:09] (03PS1) 10Yuvipanda: nagios_common: Add 'none' time period [puppet] - 10https://gerrit.wikimedia.org/r/165959 [01:57:30] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 220, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [01:58:09] RECOVERY - puppet last run on mw1128 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [02:00:27] (03PS2) 10Yuvipanda: nagios_common: Add 'none' time period [puppet] - 10https://gerrit.wikimedia.org/r/165959 [02:01:45] (03CR) 10Dzahn: "what will this be used for?" [puppet] - 10https://gerrit.wikimedia.org/r/165959 (owner: 10Yuvipanda) [02:02:27] (03PS3) 10Yuvipanda: nagios_common: Add 'none' time period [puppet] - 10https://gerrit.wikimedia.org/r/165959 [02:02:37] (03CR) 10Yuvipanda: "updated commit message to clarify." [puppet] - 10https://gerrit.wikimedia.org/r/165959 (owner: 10Yuvipanda) [02:02:44] mutante: updated commit message to clarify [02:03:45] Deskana: looking at the patch, it looks like what MobileApp is doing is wrong. [02:04:30] legoktm: wanna patch? :D [02:04:33] legoktm: In what sense? [02:04:49] Deskana: it's referencing paths in an extension it does not control... [02:05:09] Deskana: well, the comment says "// FIXME: This is crazy. Don't do this. Use $wgResourceLoaderLESSImportPaths." [02:05:10] Deskana: when I wrote it, I got Juliusz to promise to tell me when the paths change... [02:05:15] * Deskana leaves the engineers to fix the problem :) [02:05:17] legoktm: well, Jon just added that today [02:05:20] right [02:05:26] so the fix isn't monitoring [02:05:31] the fix is to write it properly [02:05:35] legoktm: true :D [02:05:46] easy fix: make MF commits run MobileApp tests [02:06:00] what is wgResourceLoaderLESSImportPaths [02:06:15] but then you get into a dependency loop, because MF won't pass until you fix MobileApp, and MobileApp won't pass until you update MF [02:06:35] $wgResourceLoaderLESSImportPaths = array( [02:06:35] "$IP/resources/src/mediawiki.less/", [02:06:35] ); [02:06:48] hm [02:06:59] > Extensions need not (and should not) register paths in $wgResourceLoaderLESSImportPaths. [02:07:06] (03CR) 10Dzahn: [C: 032] "icinga doesn't appear to hate it in config. almost expected it to complain but looks ok" [puppet] - 10https://gerrit.wikimedia.org/r/165959 (owner: 10Yuvipanda) [02:07:10] legoktm: Patches welcome. :) [02:07:30] And on that cheerful note, I'm off home. [02:07:46] Deskana: my patches to that extension have been sitting for over a month. [02:07:51] two months today even! [02:08:00] legoktm: oh, which ones? [02:08:01] Ping... Yuvi! :D [02:08:04] In a bit. [02:08:04] https://gerrit.wikimedia.org/r/#/c/152838/ [02:08:12] cba to rebase though. [02:08:43] (03CR) 10Dzahn: [C: 032] icinga: Split mysql related config entries into separate file [puppet] - 10https://gerrit.wikimedia.org/r/165956 (owner: 10Yuvipanda) [02:10:02] YuviPanda: easiest fix, in MF's resources file add a // NOTE: These files are also used by E:MobileApp, remember to update that if you change anything [02:10:23] (03CR) 10Dzahn: [C: 032] icinga: Separate smtp configuration into own file [puppet] - 10https://gerrit.wikimedia.org/r/165955 (owner: 10Yuvipanda) [02:10:48] legoktm: hah, true [02:12:36] legoktm: did you test that patch? [02:12:47] legoktm: usually by doing a curl and saving output before and after patch, and doing a diff [02:12:48] at the time yes [02:12:57] (03CR) 10Dzahn: "true, role::wikitech would seem good, and make that include openstack and this (and maybe more)" [puppet] - 10https://gerrit.wikimedia.org/r/162619 (owner: 10Hoo man) [02:13:10] RECOVERY - ElasticSearch health check on elastic1017 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 18: number_of_data_nodes: 18: active_primary_shards: 2029: active_shards: 6082: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [02:13:49] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 [02:13:53] mutante: can you do 'ack-grep notify-service-by-email' in /etc/icinga and tell me what you see? [02:13:58] (03CR) 10Dzahn: "Ori meanwhile suggested to delete this in Change-Id: I69150281fb08a9f97f . i don't know about this one then" [puppet] - 10https://gerrit.wikimedia.org/r/160953 (owner: 10Alexandros Kosiaris) [02:14:33] (03CR) 10Krinkle: "@Dzahn: afaik the web server is up for no reason, but there is no aggregator/collector to receive the data. The deamons on the clients (th" [puppet] - 10https://gerrit.wikimedia.org/r/165360 (owner: 10Yuvipanda) [02:16:50] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 220, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [02:17:43] (03CR) 10Dzahn: "Yuvi, how about the Android builds meanwhile? i remember discussion at Wikimania about doing this in toollabs at all or a separate build p" [puppet] - 10https://gerrit.wikimedia.org/r/153600 (owner: 10Yuvipanda) [02:17:59] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 [02:18:08] YuviPanda: nothing [02:18:16] mutante: hmm, at all? [02:18:19] that's... werid [02:18:21] *weird [02:18:36] commands.cfg:# 'notify-service-by-email' command definition [02:18:36] commands.cfg: command_name notify-service-by-email [02:18:36] objects/contacts_icinga.cfg: service_notification_commands notify-service-by-email [02:19:08] mutante: aha, objects/contacts_icinga.cfg! [02:19:12] is... not... in... puppet [02:19:13] goddamnit [02:19:41] YuviPanda: there is just "root" in there [02:19:54] and the email is root@localhost [02:20:10] well, and one contactgroup called "admins" with only 1 member [02:20:12] that root user [02:20:23] mutante: hmm, in contacts.cfg in private repo, what's the value for service_notification_commands for any user? [02:20:51] (03CR) 10Yuvipanda: "They're currently running in toollabs. https://tools.wmflabs.org/wikipedia-android-builds/" [puppet] - 10https://gerrit.wikimedia.org/r/153600 (owner: 10Yuvipanda) [02:21:29] PROBLEM - puppet last run on mw1083 is CRITICAL: CRITICAL: Puppet has 1 failures [02:21:53] YuviPanda: they are different, depending if it's just email, or IRC bot, or ops person who gets pages [02:22:04] mutante: what's it for just email? [02:22:14] notify-by-email [02:22:28] hmm [02:22:28] ok [02:23:07] (03CR) 10Dzahn: "how do you compile without JDK? manual install?" [puppet] - 10https://gerrit.wikimedia.org/r/153600 (owner: 10Yuvipanda) [02:23:53] mutante: nope, I ran a dsh installing jdk, assuming the patch will be merged imminently... [02:23:59] clearly that hasn't happened... [02:24:03] I need to abstract this thing out. [02:24:43] eh, that's what i meant by manual, basically :) [02:24:49] via dsh or not [02:25:03] ah, yesh [02:25:05] manual install :) [02:25:50] (03PS1) 10Yuvipanda: shinken: Make shinken read new commands & notifications defs [puppet] - 10https://gerrit.wikimedia.org/r/165961 [02:27:10] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:29:00] (03PS2) 10Yuvipanda: androidsdk: Make sure that JDK is present [puppet] - 10https://gerrit.wikimedia.org/r/153600 [02:29:09] mutante: ^ fixed now [02:30:41] (03CR) 10Dzahn: [C: 032] androidsdk: Make sure that JDK is present [puppet] - 10https://gerrit.wikimedia.org/r/153600 (owner: 10Yuvipanda) [02:30:44] YuviPanda: thanks :) [02:31:09] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [02:32:33] mutante: btw, running puppet on tools-dev to verify [02:33:17] mutante: run completed without any issues :) [02:34:18] YuviPanda: why u no sleep !? [02:34:29] bd808: tried and failed. [02:34:41] bd808: turn and twist in bed for 1.5h, no use... [02:34:55] bd808: at least I fixed all the config issues with shinken and it restarts properly now [02:35:14] and a lot of things required to write a 'shingen' are in place... [02:35:17] YuviPanda: nice [02:35:18] although I probably shouldn't call it that. [02:35:45] yuvimontor? [02:35:56] mutante: yeah, legoktm wrote / merged a wikitech API to list all projects, and then list instances in a project... [02:35:59] much saner than SMW [02:37:14] bd808: haha... no :P [02:37:22] nobody loves naggen... [02:37:37] I'll have to figure out some way of 'opting in' to monitoring... [02:37:47] and first step would probably be 'up' monitoring, which just does check_ping [02:38:21] no monitor_service here [02:38:24] probably hiera... [02:38:48] RECOVERY - puppet last run on mw1083 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [02:38:57] once I've ping monitoring in place, I'll write something up and email ops@ [02:39:24] !log LocalisationUpdate completed (1.25wmf2) at 2014-10-10 02:39:23+00:00 [02:39:33] Logged the message, Master [02:42:28] * YuviPanda goes to try sleeping again [02:42:31] ty, mutante! [02:53:39] PROBLEM - puppet last run on mw1210 is CRITICAL: CRITICAL: Puppet has 1 failures [03:08:59] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures [03:09:28] PROBLEM - puppet last run on mw1172 is CRITICAL: CRITICAL: Puppet has 1 failures [03:10:59] RECOVERY - puppet last run on mw1210 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [03:12:51] !log LocalisationUpdate completed (1.25wmf3) at 2014-10-10 03:12:51+00:00 [03:12:56] Logged the message, Master [03:26:19] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [03:26:48] RECOVERY - puppet last run on mw1172 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [03:44:33] !log enable purging of old eventlogging data from specific tables on m2-master, as per analytics@ discussion [03:44:43] Logged the message, Master [03:47:16] (03PS1) 10Springle: Events for m2-master [software] - 10https://gerrit.wikimedia.org/r/165966 [03:48:43] (03CR) 10Springle: [C: 032] Events for m2-master [software] - 10https://gerrit.wikimedia.org/r/165966 (owner: 10Springle) [03:52:58] springle: noooooooooooo, every byte is sacred! every byte is great! if a byte is wasted, god gets quite irate! [03:53:08] just kidding, thanks for driving that discussion [03:53:22] :D [03:54:11] one day we'll get serious about big data, like airbnb: http://mesos.apache.org/ "At Airbnb, we're using Mesos to manage cluster resources for most of our data infrastructure. We run Chronos, Storm, and Hadoop on top of Mesos in order to process petabytes of data." Petabytes! [03:54:54] Full click stream tracking for all the 'pedias would be huge data [03:56:02] the human genome project fits in 8gb [03:56:13] but airbnb is processing petabytes [03:56:43] ebay has >500 (maybe >1000 now) nodes in their hadoop cluster [03:57:11] "Consistent with GDB's historical focus on mapping, the main classes of data in the database are maps, genes, amplimers, clones and polymorphisms, as well as supporting data such as references. The total database size is now upwards of 8 gigabytes. " (http://nar.oxfordjournals.org/content/26/1/94.full) [03:58:04] fortunately if you accumulate enough data it becomes sentient [03:58:05] Sure, but can it help you upsell faster shipping? [03:58:30] the human genome project did very little a/b testing, 'tis true [03:59:08] Found latest slides on ebay hadoop, 10,000 nodes; 150 PB [03:59:28] http://www.slideshare.net/Hadoop_Summit/hadoop2-ebay [04:00:20] ebay has been around since 1995, at least [04:00:34] and they own paypal [04:01:05] paypal has separate network ops and hardware. this is just for selling pez collectables [04:01:54] 5 - 7 TB of logs ingested per day [04:03:12] that's not bad [04:03:37] airbnb has had approximately 16 million bookings since they launched, according to wikipedia [04:03:57] since 'petabytes' is plural let's assume a modest two petabytes [04:04:18] that's a cool 125 megabytes per transaction [04:05:07] maybe they have one of those multifunction office copier/scanner things that generates huge TIFFs [04:05:42] Or just a lot of page views per conversion [04:06:00] and js based mouse movement heatmaps [04:06:33] hey, i found some pdfs on wikipedia where the pdf is 190MB but if you split it into pages there are 300 pages of 188MB each. maybe their office copier/scanner generates output like that. [04:06:49] and they are making scanned copies of every posting on their site [04:06:52] zip bomb! [04:07:21] cscott: how does that work? [04:08:11] ori: i'm not sure, but i think that all the useful data is in the (common) prologue, and then each page is basically just (1 \renderpage), (2 \renderpage) or something like that. [04:08:24] so pdfseparate dutifully copies the "prologue" to each page [04:10:07] oh, charming [04:15:34] so, i've got another ocg deploy ready to go, i was waiting to see if arlolra would come online tonight to walk him through a late night deploy [04:16:21] but he apparently has a reasonable work-life balance. so maybe i should just deploy it myself. it's not yet friday in sfo. ;) [04:19:49] !log upgrade db1046 mariadb 10 [04:19:55] Logged the message, Master [04:22:47] <^d> !log elasticsearch upgrade from 1.3.2 -> 1.3.4 complete for all 18 nodes. Sporadic icinga warnings about health should go away now [04:22:53] Logged the message, Master [04:25:21] (03PS1) 10Springle: Prepare db1046 for upgrade, switch to production MariaDB 10 config. [puppet] - 10https://gerrit.wikimedia.org/r/165970 [04:26:49] (03CR) 10Springle: [C: 032] Prepare db1046 for upgrade, switch to production MariaDB 10 config. [puppet] - 10https://gerrit.wikimedia.org/r/165970 (owner: 10Springle) [04:36:28] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [04:38:38] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Oct 10 04:38:38 UTC 2014 (duration 38m 37s) [04:38:46] Logged the message, Master [04:42:38] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [04:52:19] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 220, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [05:04:35] Possible to add versioned dependency in Puppet for packages? [05:05:08] for example: apertium (>= 3.3.0.56825-1) [05:05:26] as we are not using default apertium from trusty. [05:05:33] akosiaris: ^ [05:16:33] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 [05:39:44] kart_: yes; you can specify a version number as the value for 'ensure' [05:42:16] kart_: here's an example: https://github.com/wikimedia/operations-puppet/blob/production/modules/mysql_wmf/manifests/packages.pp#L36-56 [05:43:38] kart_: there's also an apt::pin resource: https://github.com/wikimedia/operations-puppet/blob/production/modules/memcached/manifests/init.pp#L20-24 [06:22:54] <_joe_> !log doing some load testing on HHVM (api) [06:22:59] Logged the message, Master [06:30:13] PROBLEM - puppet last run on db1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:43] PROBLEM - puppet last run on search1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:43] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:53] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:54] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:02] PROBLEM - puppet last run on db2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:12] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:12] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:13] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:23] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:24] PROBLEM - puppet last run on search1018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:45:03] RECOVERY - puppet last run on db1018 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:45:48] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [06:45:49] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [06:45:49] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [06:45:52] RECOVERY - puppet last run on db2018 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:46:03] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:46:03] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [06:46:03] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:46:04] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [06:46:22] RECOVERY - puppet last run on search1018 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:46:33] RECOVERY - puppet last run on search1001 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:49:29] <_joe_> !log load testing done [06:49:35] Logged the message, Master [06:53:34] PROBLEM - puppet last run on es1002 is CRITICAL: CRITICAL: Puppet has 2 failures [06:53:41] ori: thanks! [06:54:53] PROBLEM - puppet last run on virt1000 is CRITICAL: CRITICAL: Puppet has 1 failures [06:58:55] (03PS4) 10KartikMistry: WIP: apertium service for Beta [puppet] - 10https://gerrit.wikimedia.org/r/165485 [07:11:03] RECOVERY - puppet last run on es1002 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [07:12:33] !log running migratePass0 across all wikis [07:12:39] Logged the message, Master [07:13:22] RECOVERY - puppet last run on virt1000 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [09:33:16] (03PS1) 10Alexandros Kosiaris: WIP: Backup /var/lib/jenking/config.xml on gallium [puppet] - 10https://gerrit.wikimedia.org/r/165991 [09:33:39] hashar: ^ [09:40:33] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Do we have a replacement for apache-graceful-all ? I still find it useful and is documented in many places in wikitech" [puppet] - 10https://gerrit.wikimedia.org/r/164508 (owner: 10Ori.livneh) [09:48:49] akosiaris: I feel lame :-D [09:49:40] I did not quite understood why the file set were in a common role and did not want to mess it up [09:49:49] sounded like a global config for me that is applied on every hosts [09:49:49] don't... The backup manifests could use a little cleanup [09:50:55] akosiaris: should we have the fileset named contint-master and then add entries to the includes => array? [09:51:05] or add a file set per file hierarchy ? [09:52:02] probably the 1st. I somehow doubt it will be reusable throughout the rest of the cluster [09:52:31] I am btw kind of worried about that /srv/ssd thing [09:52:42] that you want /srv/ssd excluded but not [09:52:55] /srv/ssd/something/something/zuul/.git/refs [09:57:17] yeah I just replied to your email [09:57:26] /srv/ssd hold the job workspaces which we don't care about [09:57:50] zuul/.git/refs are the references Zuul creates when merging a proposed patchset on the tip of the branch [09:58:01] the resulting references is then passed to Jenkins so it can fetch the merge commit [09:58:18] they have one time usage, so we can loose them :] [09:59:39] as for where to apply the backup::set {} it would probably be better done on the role class [09:59:48] though they are applied on labs instance as well [10:00:13] Zuul has two services: 'merger' and 'server'. We now have a role for each [10:00:47] maybe it will just work on labs :] [10:01:46] labs ? [10:01:54] I kind of doubt it ... [10:02:18] the labs instance got deleted anyway so I can look at it later on :] [10:05:01] (03CR) 10Hashar: WIP: Backup /var/lib/jenking/config.xml on gallium (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/165991 (owner: 10Alexandros Kosiaris) [10:05:47] (03PS1) 10Filippo Giunchedi: swift: enable container-server statsd reporting [puppet] - 10https://gerrit.wikimedia.org/r/165995 [10:42:49] (03PS2) 10Filippo Giunchedi: swift: enable container-server statsd reporting [puppet] - 10https://gerrit.wikimedia.org/r/165995 [10:47:43] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: enable container-server statsd reporting [puppet] - 10https://gerrit.wikimedia.org/r/165995 (owner: 10Filippo Giunchedi) [10:49:35] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 220, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [11:00:13] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Add initial Debian packaging [debs/contenttranslation/apertium-es-ca] - 10https://gerrit.wikimedia.org/r/163578 (owner: 10KartikMistry) [11:00:48] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Added initial Debian packaging [debs/contenttranslation/apertium-es-pt] - 10https://gerrit.wikimedia.org/r/165473 (owner: 10KartikMistry) [11:06:52] (03PS1) 10Giuseppe Lavagetto: mediawiki: collect hhvm stats [puppet] - 10https://gerrit.wikimedia.org/r/166003 [11:07:36] (03CR) 10jenkins-bot: [V: 04-1] mediawiki: collect hhvm stats [puppet] - 10https://gerrit.wikimedia.org/r/166003 (owner: 10Giuseppe Lavagetto) [11:11:59] <_joe_> I want to kick our pep8 rules in the teeth [11:13:19] (03PS2) 10Giuseppe Lavagetto: mediawiki: collect hhvm stats [puppet] - 10https://gerrit.wikimedia.org/r/166003 [11:13:22] !log rolling restart of container-server on ms-be2* [11:13:32] Logged the message, Master [11:20:49] (03PS3) 10Giuseppe Lavagetto: mediawiki: collect hhvm stats [puppet] - 10https://gerrit.wikimedia.org/r/166003 [11:20:57] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] mediawiki: collect hhvm stats [puppet] - 10https://gerrit.wikimedia.org/r/166003 (owner: 10Giuseppe Lavagetto) [11:22:12] !log rolling restart of container-server on ms-be1* [11:22:18] Logged the message, Master [11:24:13] PROBLEM - puppet last run on mw1018 is CRITICAL: CRITICAL: puppet fail [11:25:05] <_joe_> ^^ me, damn typo [11:25:22] (03PS1) 10Giuseppe Lavagetto: mediawiki: fix typo in resource dependency [puppet] - 10https://gerrit.wikimedia.org/r/166004 [11:25:44] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] mediawiki: fix typo in resource dependency [puppet] - 10https://gerrit.wikimedia.org/r/166004 (owner: 10Giuseppe Lavagetto) [11:27:16] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: puppet fail [11:27:26] RECOVERY - puppet last run on mw1018 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [11:36:25] (03PS1) 10Manybubbles: Turn on building Cirrus index for faster regexes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166005 [11:41:06] (03PS1) 10Giuseppe Lavagetto: mediawiki: fix typo in python script [puppet] - 10https://gerrit.wikimedia.org/r/166007 [11:41:38] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: fix typo in python script [puppet] - 10https://gerrit.wikimedia.org/r/166007 (owner: 10Giuseppe Lavagetto) [11:46:56] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [11:47:45] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 [11:51:45] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 220, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [11:57:56] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 [12:00:58] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 220, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [12:02:56] PROBLEM - puppet last run on ms-be2002 is CRITICAL: CRITICAL: puppet fail [12:03:45] PROBLEM - puppet last run on ms-be1013 is CRITICAL: CRITICAL: puppet fail [12:11:43] (03PS3) 10KartikMistry: WIP: Added initial Debian package [debs/contenttranslation/apertium-apy] - 10https://gerrit.wikimedia.org/r/165528 [12:21:06] RECOVERY - puppet last run on ms-be1013 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [12:22:26] RECOVERY - puppet last run on ms-be2002 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [12:39:54] (03PS1) 10Hashar: contint: switch Zuul conf to new repository [puppet] - 10https://gerrit.wikimedia.org/r/166012 [12:40:48] I could use a git repo renaming for Zuul configuration https://gerrit.wikimedia.org/r/166012 :D [12:42:41] Coren: wanna merge some shinken patches for me? :) https://gerrit.wikimedia.org/r/#/c/165951/ and then follow the dependency chain... [12:42:52] well, some of them affect icinga too, so would need a run on neon... [12:43:21] YuviPanda: Lemme finish waking up and catching up to my email and I'll take a look. [12:43:24] Coren: ok! [12:43:53] Coren: btw, what's your TZ? [12:44:02] UTC-4 atm [12:44:11] ah, cool [12:44:25] * YuviPanda should write a small application to calculate his current timezone. [12:45:53] (03CR) 10Hashar: [C: 031] "I have manually adjusted the URL and git::clone does not complain :]" [puppet] - 10https://gerrit.wikimedia.org/r/166012 (owner: 10Hashar) [12:46:47] YuviPanda: afaict, your timezone is easy to calculate: (rand()%23)-12 [12:46:53] haha :D [12:46:58] yeah, woke up at 5pm today... [12:48:23] To be fair, my own effective timezone is sorta UTC-4±1 :-) [12:50:33] Coren: heh, I seem to drift somewhere between mid-atlantic and west-of-san-francisco... [12:53:19] (03CR) 10Alexandros Kosiaris: [C: 031] swift-synctool: enable/disable/show sync [software] - 10https://gerrit.wikimedia.org/r/160428 (owner: 10Filippo Giunchedi) [13:00:41] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Yes you are right of course Daniel about the default_sites. So ,patch +2 apart from the default_sites." [puppet] - 10https://gerrit.wikimedia.org/r/164498 (owner: 10Dzahn) [13:01:37] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 222, down: 0, dormant: 0, excluded: 0, unused: 0 [13:03:46] (03CR) 10Alexandros Kosiaris: [C: 032] rancid - remove pmtpa devices from router.db [puppet] - 10https://gerrit.wikimedia.org/r/165674 (owner: 10Dzahn) [13:18:41] (03CR) 10Chad: [C: 031] "Fire at will commander." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166005 (owner: 10Manybubbles) [13:19:06] any objection to me deploying a super safe config change? [13:19:30] i only have objections when it turns out not to be super safe ;-) [13:20:14] <^d> heh [13:20:27] (03PS6) 10Filippo Giunchedi: swift-synctool: enable/disable/show sync [software] - 10https://gerrit.wikimedia.org/r/160428 [13:20:34] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift-synctool: enable/disable/show sync [software] - 10https://gerrit.wikimedia.org/r/160428 (owner: 10Filippo Giunchedi) [13:23:50] (03PS2) 10Alexandros Kosiaris: remove Tampa nas servers [dns] - 10https://gerrit.wikimedia.org/r/165412 (owner: 10Dzahn) [13:24:40] nice:) [13:24:51] I promise [13:25:52] akosiaris: thanks for the feedback on 160428 ! btw I got around to the 127.0.1.1 stopga^Wfix what do you think? [13:27:25] (03CR) 10Chad: First of (hopefully many) es-tool commands (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/163945 (owner: 10Chad) [13:27:51] godog: kind of weird way of doing it. So it depends on exec failing so that puppet will report one failure and we detect it via icinga ? [13:28:13] cause it will probably not be detected by the human eye on very first run (or second for that) [13:28:14] (03CR) 10Manybubbles: [C: 032] Turn on building Cirrus index for faster regexes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166005 (owner: 10Manybubbles) [13:28:26] (03Merged) 10jenkins-bot: Turn on building Cirrus index for faster regexes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166005 (owner: 10Manybubbles) [13:28:38] (03CR) 10Alexandros Kosiaris: [C: 032] smtp service -> polonium, remove imap service [dns] - 10https://gerrit.wikimedia.org/r/164262 (owner: 10Dzahn) [13:29:16] !log manybubbles Synchronized wmf-config/CirrusSearch-common.php: Add configuration so cirrus can build an index to speed up regex searches (duration: 00m 04s) [13:29:21] Logged the message, Master [13:29:33] (03CR) 10Alexandros Kosiaris: [C: 032] remove Tampa nas servers [dns] - 10https://gerrit.wikimedia.org/r/165412 (owner: 10Dzahn) [13:29:38] akosiaris: yeah, "puppet not running" is a condition we should be aware of anyway, however if you can think of another way let me know! [13:30:04] will it be puppet not running ? [13:30:19] I think it will be just a failed resource, right ? [13:31:03] we've monitoring for failures too, right? [13:31:12] (03CR) 10Mark Bergsma: "The smtp service name was meant for authenticated SMTP submission, which only sanger did. polonium is not a replacement for that. It might" [dns] - 10https://gerrit.wikimedia.org/r/164262 (owner: 10Dzahn) [13:31:13] yes we do [13:31:23] * YuviPanda wonders if we should turn on puppet metrics graphing in prod too - we have some in labs [13:31:53] most useful would probably be the puppet runtime across machines, but unsure how useful that would be... [13:31:56] mark: well smokeping was using smtp.wikimedia.org [13:32:09] hm ok [13:32:17] by accident I think ;) [13:32:27] so it was failing all this time ? [13:32:53] niah it wasn't .... [13:33:55] and it was not going through sanger anyway [13:34:05] akosiaris: good question, I don't think it should block puppet running, no [13:34:22] so something was fishy in that config anyway... investigating [13:42:12] (03PS6) 10Chad: Adding tools for banning/unbanning an ES node [puppet] - 10https://gerrit.wikimedia.org/r/164617 [13:42:14] (03PS8) 10Chad: First of (hopefully many) es-tool commands [puppet] - 10https://gerrit.wikimedia.org/r/163945 [13:42:16] (03PS6) 10Chad: Another es-tool function: restart a node the fast & easy way [puppet] - 10https://gerrit.wikimedia.org/r/164401 [13:42:18] (03PS7) 10Chad: More elasticsearch tools [puppet] - 10https://gerrit.wikimedia.org/r/164270 [13:58:07] (03PS1) 10Giuseppe Lavagetto: diamond: correct collector class name [puppet] - 10https://gerrit.wikimedia.org/r/166037 [14:01:14] <^d> godog: That latest update to the es-tool changes ^ were some stylistic changes requested by or.i [14:01:26] <^d> No functional changes. [14:02:06] ^d: ack, I'll take a look [14:05:03] (03CR) 10Filippo Giunchedi: [C: 031] First of (hopefully many) es-tool commands [puppet] - 10https://gerrit.wikimedia.org/r/163945 (owner: 10Chad) [14:05:40] (03CR) 10Giuseppe Lavagetto: [C: 032] diamond: correct collector class name [puppet] - 10https://gerrit.wikimedia.org/r/166037 (owner: 10Giuseppe Lavagetto) [14:06:21] <_joe_> hashar: is zuul down atm? [14:06:31] ^d: looks good, I'll leave my shed full of trailing commas for another review [14:06:31] oh man [14:06:33] (03CR) 10Giuseppe Lavagetto: [V: 032] diamond: correct collector class name [puppet] - 10https://gerrit.wikimedia.org/r/166037 (owner: 10Giuseppe Lavagetto) [14:06:38] it went wild again [14:07:27] !log Disconnecting / reconnecting Jenkins/Zuul gearman as per https://bugzilla.wikimedia.org/show_bug.cgi?id=63760#c12 [14:07:35] Logged the message, Master [14:08:15] <^d> godog: Thx. I'm getting tired of rebasing that full chain :p [14:10:53] _joe_: fixed [14:11:45] * _joe_ kicks himself [14:12:50] (03PS1) 10Giuseppe Lavagetto: diamond: fix collector [puppet] - 10https://gerrit.wikimedia.org/r/166040 [14:13:49] (03CR) 10Giuseppe Lavagetto: [C: 032] diamond: fix collector [puppet] - 10https://gerrit.wikimedia.org/r/166040 (owner: 10Giuseppe Lavagetto) [14:14:13] could use a git::cloner change for contint, I have renamed a git repository https://gerrit.wikimedia.org/r/#/c/166012/ [14:14:20] already applied manually [14:14:37] (03PS1) 10Alexandros Kosiaris: Remove mailhost setting from smokeping [puppet] - 10https://gerrit.wikimedia.org/r/166042 [14:17:58] (03PS1) 10Alexandros Kosiaris: Remove smtp.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/166043 [14:18:23] mark: https://gerrit.wikimedia.org/r/#/c/166042/1 [14:18:33] fallback of a fallback of a fallback [14:19:46] (03PS1) 10RobH: codfw es servers production ip allocation [dns] - 10https://gerrit.wikimedia.org/r/166044 [14:20:27] (03PS2) 10Alexandros Kosiaris: Remove mailhost setting from smokeping [puppet] - 10https://gerrit.wikimedia.org/r/166042 [14:21:42] akosiaris: haha, I wanted to do the same [14:21:58] nice :) [14:22:21] (03CR) 10Faidon Liambotis: [C: 032] Remove mailhost setting from smokeping [puppet] - 10https://gerrit.wikimedia.org/r/166042 (owner: 10Alexandros Kosiaris) [14:22:39] <_joe_> oh well, after all it's diamond's fault if my collector doesn't work [14:22:43] I saw daniel's change, said "mhhm smtp.wm.org, I should kill that" [14:22:55] (03CR) 10Alexandros Kosiaris: "mark's comment was addressed in 9956be090a68af. The underlying cause for the bad merge of this was addressed in 07f985a62" [dns] - 10https://gerrit.wikimedia.org/r/164262 (owner: 10Dzahn) [14:23:03] <_joe_> diamond doesn't work with custom collectors on trusty, wtf? [14:23:51] paravoid: I thought about it. then git grep in puppet said hey "this is used!!!!" [14:24:13] and well... I assumed (hence the error) that it was indeed working.... [14:24:39] (03CR) 10RobH: [C: 032] codfw es servers production ip allocation [dns] - 10https://gerrit.wikimedia.org/r/166044 (owner: 10RobH) [14:24:49] _joe_: are you sur? I've a few custom collectors running on labs in trusty... [14:26:39] (03CR) 10Alexandros Kosiaris: [C: 032] Remove smtp.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/166043 (owner: 10Alexandros Kosiaris) [14:27:18] <_joe_> YuviPanda: mmmh the error I get is quite strange [14:27:29] robh: I merge the es changes as well [14:27:41] <_joe_> File "/usr/lib/pymodules/python2.7/diamond/collector.py", line 170, in __init__ [14:27:44] akosiaris: Thanks! [14:27:44] <_joe_> if isinstance(self.config['byte_unit'], basestring): [14:27:51] <_joe_> YuviPanda: ^^ [14:28:05] <_joe_> the config doesn't contain 'byte_unit' [14:28:12] that's set by the default config... [14:28:20] maybe you're not inheriting/overriding the config object properly..? [14:29:02] there's a get_default_config() that you override... [14:29:07] <_joe_> class hhvm_healthCollector(diamond.collector.Collector): [14:29:22] <_joe_> YuviPanda: damn I found it [14:29:32] <_joe_> that wasn't the problem [14:31:31] (03PS1) 10Hashar: contint: ruby2.0 on Trusty slaves [puppet] - 10https://gerrit.wikimedia.org/r/166046 [14:32:06] (03PS1) 10Giuseppe Lavagetto: diamond: fix collector (again) [puppet] - 10https://gerrit.wikimedia.org/r/166047 [14:32:59] (03PS2) 10Giuseppe Lavagetto: diamond: fix collector (again) [puppet] - 10https://gerrit.wikimedia.org/r/166047 [14:33:07] (03CR) 10Hashar: "Cherry picked on contint puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/166046 (owner: 10Hashar) [14:33:44] (03CR) 10Giuseppe Lavagetto: [C: 032] diamond: fix collector (again) [puppet] - 10https://gerrit.wikimedia.org/r/166047 (owner: 10Giuseppe Lavagetto) [14:41:24] (03CR) 10Hashar: "let us run gem/bundler based jobs on Trusty instance https://gerrit.wikimedia.org/r/166049" [puppet] - 10https://gerrit.wikimedia.org/r/166046 (owner: 10Hashar) [14:50:13] (03PS3) 10Yuvipanda: nagios_common: Allow content to be set for command config [puppet] - 10https://gerrit.wikimedia.org/r/165951 [14:50:15] (03PS2) 10Yuvipanda: shinken: Make shinken read new commands & notifications defs [puppet] - 10https://gerrit.wikimedia.org/r/165961 [14:50:17] (03PS4) 10Yuvipanda: nagios_common: Add 'none' time period [puppet] - 10https://gerrit.wikimedia.org/r/165959 [14:50:19] (03PS2) 10Yuvipanda: icinga: Split mysql related config entries into separate file [puppet] - 10https://gerrit.wikimedia.org/r/165956 [14:50:21] (03PS2) 10Yuvipanda: nagios_common: Move checkcommands.cfg into module, from icinga [puppet] - 10https://gerrit.wikimedia.org/r/165957 [14:50:23] (03PS2) 10Yuvipanda: icinga: Separate smtp configuration into own file [puppet] - 10https://gerrit.wikimedia.org/r/165955 [15:00:39] andrewbogott: seems ok on shinken... [15:01:34] (03PS1) 10Yuvipanda: shinken: Use appropriate notification command [puppet] - 10https://gerrit.wikimedia.org/r/166053 [15:02:29] (03CR) 10Andrew Bogott: [C: 032] nagios_common: Allow content to be set for command config [puppet] - 10https://gerrit.wikimedia.org/r/165951 (owner: 10Yuvipanda) [15:14:27] andrewbogott: stepping out for food... [15:22:56] PROBLEM - Host mw1027 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:07] RECOVERY - Host mw1027 is UP: PING OK - Packet loss = 0%, RTA = 3.80 ms [15:30:07] PROBLEM - nutcracker process on mw1027 is CRITICAL: Connection refused by host [15:30:15] PROBLEM - SSH on mw1027 is CRITICAL: Connection refused [15:30:15] PROBLEM - Disk space on mw1027 is CRITICAL: Connection refused by host [15:30:16] PROBLEM - Apache HTTP on mw1027 is CRITICAL: Connection refused [15:30:26] PROBLEM - check if salt-minion is running on mw1027 is CRITICAL: Connection refused by host [15:30:37] PROBLEM - check if dhclient is running on mw1027 is CRITICAL: Connection refused by host [15:30:46] PROBLEM - puppet last run on mw1027 is CRITICAL: Connection refused by host [15:30:55] PROBLEM - DPKG on mw1027 is CRITICAL: Connection refused by host [15:30:59] PROBLEM - check configured eth on mw1027 is CRITICAL: Connection refused by host [15:31:00] PROBLEM - RAID on mw1027 is CRITICAL: Connection refused by host [15:31:00] PROBLEM - nutcracker port on mw1027 is CRITICAL: Connection refused by host [15:39:26] RECOVERY - SSH on mw1027 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [15:43:47] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Various comments here and there" (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/165485 (owner: 10KartikMistry) [15:50:38] PROBLEM - RAID on mw1027 is CRITICAL: Connection refused by host [15:51:08] PROBLEM - check configured eth on mw1027 is CRITICAL: Connection refused by host [15:51:10] PROBLEM - check if dhclient is running on mw1027 is CRITICAL: Connection refused by host [15:51:28] PROBLEM - check if salt-minion is running on mw1027 is CRITICAL: Connection refused by host [15:51:49] PROBLEM - nutcracker port on mw1027 is CRITICAL: Connection refused by host [15:51:49] PROBLEM - nutcracker process on mw1027 is CRITICAL: Connection refused by host [15:51:50] PROBLEM - DPKG on mw1027 is CRITICAL: Connection refused by host [15:52:08] PROBLEM - Disk space on mw1027 is CRITICAL: Connection refused by host [15:52:08] PROBLEM - puppet last run on mw1027 is CRITICAL: Connection refused by host [15:56:02] (03CR) 10Alexandros Kosiaris: [C: 04-1] base: add checks for 127.0.1.1 in /etc/hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/157795 (owner: 10Filippo Giunchedi) [15:57:44] (03CR) 10Alexandros Kosiaris: WIP: Backup /var/lib/jenking/config.xml on gallium (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/165991 (owner: 10Alexandros Kosiaris) [15:57:51] PROBLEM - Host ps1-b7-eqiad is DOWN: CRITICAL - Plugin timed out after 15 seconds [15:58:06] <_joe_> mmmh [15:58:09] RECOVERY - Host ps1-b7-eqiad is UP: PING OK - Packet loss = 0%, RTA = 3.68 ms [15:58:13] <_joe_> is this a power line? [15:58:29] akosiaris: thanks will follow up on monday :) [15:58:48] _joe_: power strip [15:59:40] rack b7 [16:01:59] RECOVERY - RAID on mw1027 is OK: OK: no RAID installed [16:02:20] RECOVERY - nutcracker port on mw1027 is OK: TCP OK - 0.000 second response time on port 11212 [16:02:20] RECOVERY - nutcracker process on mw1027 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [16:02:20] RECOVERY - DPKG on mw1027 is OK: All packages OK [16:02:29] RECOVERY - check configured eth on mw1027 is OK: NRPE: Unable to read output [16:02:30] RECOVERY - Disk space on mw1027 is OK: DISK OK [16:02:39] RECOVERY - check if dhclient is running on mw1027 is OK: PROCS OK: 0 processes with command name dhclient [16:02:50] RECOVERY - check if salt-minion is running on mw1027 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:02:59] PROBLEM - NTP on mw1027 is CRITICAL: NTP CRITICAL: Offset unknown [16:03:39] PROBLEM - puppet last run on mw1027 is CRITICAL: CRITICAL: Puppet has 1 failures [16:04:30] RECOVERY - Disk space on labstore1001 is OK: DISK OK [16:11:19] PROBLEM - Apache HTTP on mw1156 is CRITICAL: Connection timed out [16:11:25] PROBLEM - Apache HTTP on mw1158 is CRITICAL: Connection timed out [16:11:29] PROBLEM - Apache HTTP on mw1160 is CRITICAL: Connection timed out [16:11:29] PROBLEM - Apache HTTP on mw1153 is CRITICAL: Connection timed out [16:11:29] PROBLEM - Apache HTTP on mw1155 is CRITICAL: Connection timed out [16:11:40] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: Connection timed out [16:12:00] PROBLEM - Apache HTTP on mw1154 is CRITICAL: Connection timed out [16:12:39] PROBLEM - Apache HTTP on mw1157 is CRITICAL: Connection timed out [16:12:59] PROBLEM - Apache HTTP on mw1159 is CRITICAL: Connection timed out [16:13:20] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 26.67% of data above the critical threshold [500.0] [16:14:29] RECOVERY - NTP on mw1027 is OK: NTP OK: Offset -0.03863227367 secs [16:14:42] <_joe_> mmmh [16:15:10] PROBLEM - very high load average likely xfs on ms-be1013 is CRITICAL: CRITICAL - load average: 105.36, 100.55, 86.01 [16:15:45] nice [16:17:25] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 1 below the confidence bounds [16:18:48] !log stopping swift on ms-be1013, debugging [16:18:50] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.109 second response time [16:18:56] Logged the message, Master [16:19:00] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.248 second response time [16:19:09] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.083 second response time [16:19:11] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.090 second response time [16:19:11] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.116 second response time [16:19:11] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.112 second response time [16:19:11] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.078 second response time [16:19:16] hm, 1014 & 1015 seem to have a similar I/O wait spike [16:19:21] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 71664 bytes in 0.405 second response time [16:19:26] but for quite a while now [16:19:49] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.119 second response time [16:19:54] ah these are the new boxes, despite the H710P, I guess we threw more I/O at them than they could handle [16:20:58] RECOVERY - very high load average likely xfs on ms-be1013 is OK: OK - load average: 11.43, 65.87, 77.94 [16:21:18] PROBLEM - swift-account-reaper on ms-be1013 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [16:21:19] PROBLEM - swift-container-updater on ms-be1013 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [16:21:26] yeah yeah we know [16:21:29] PROBLEM - swift-account-replicator on ms-be1013 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [16:21:29] PROBLEM - swift-account-auditor on ms-be1013 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [16:21:41] <_joe_> but icinga-wm will tell you anyways [16:21:48] PROBLEM - swift-account-server on ms-be1013 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [16:21:59] PROBLEM - swift-container-replicator on ms-be1013 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [16:21:59] PROBLEM - swift-object-auditor on ms-be1013 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [16:21:59] PROBLEM - swift-object-updater on ms-be1013 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [16:22:10] PROBLEM - swift-object-replicator on ms-be1013 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [16:22:10] PROBLEM - swift-object-server on ms-be1013 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [16:22:10] PROBLEM - swift-container-auditor on ms-be1013 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [16:22:10] PROBLEM - swift-container-server on ms-be1013 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [16:23:20] https://ganglia.wikimedia.org/latest/graph_all_periods.php?h=ms-be1013.eqiad.wmnet&m=cpu_report&r=week&s=by%20name&hc=4&mc=2&st=1412958183&g=load_report&z=large&c=Swift%20eqiad [16:23:24] this box isn't happy in general [16:25:20] ACKNOWLEDGEMENT - DPKG on analytics1003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages ottomata kafkatee debugging happening here! [16:25:20] ACKNOWLEDGEMENT - check if salt-minion is running on analytics1003 is CRITICAL: NRPE: Command check_check_salt_minion not defined ottomata kafkatee debugging happening here! [16:26:48] RECOVERY - swift-account-server on ms-be1013 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [16:27:03] let's see [16:27:09] RECOVERY - swift-container-replicator on ms-be1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [16:27:18] RECOVERY - swift-object-auditor on ms-be1013 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [16:27:18] RECOVERY - swift-object-updater on ms-be1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [16:27:18] RECOVERY - swift-object-replicator on ms-be1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [16:27:18] RECOVERY - swift-object-server on ms-be1013 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [16:27:28] RECOVERY - swift-container-auditor on ms-be1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [16:27:28] RECOVERY - swift-container-server on ms-be1013 is OK: PROCS OK: 13 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [16:27:29] RECOVERY - swift-account-reaper on ms-be1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [16:27:38] RECOVERY - swift-container-updater on ms-be1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [16:27:39] RECOVERY - swift-account-auditor on ms-be1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [16:27:40] RECOVERY - swift-account-replicator on ms-be1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [16:29:08] running with a 40 avg 1-min load all the time [16:31:47] paravoid: indeed more io wait because of more weight [16:31:56] paravoid: was it the xfs bug btw? [16:32:00] no [16:32:11] paravoid: such a sad load average [16:32:26] just too much load [16:32:57] we have to do something about all that [16:34:39] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [16:34:47] not sure why it'd alarm on the rendering lvs tho [16:35:15] piled up requests on the imagescalers for one file that was there and waiting on I/O [16:35:26] it's not a great architecture [16:35:33] not anymore anyway [16:39:50] right, monday I'll rebalance down those disks a bit that should lessen the load [16:47:30] <^d> godog: What was that command line tool you were using to monitor incoming/outgoing http requests to a server? [16:49:39] RECOVERY - puppet last run on mw1027 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [16:51:49] ^d: httpry maybe? [16:52:54] <^d> Ah that was it [16:52:59] <^d> thx [16:54:47] :) [17:01:14] mutante: is https://rt.wikimedia.org/Ticket/Display.html?id=8617 something you can take care of? Or shall we encourage scott to wait for the phab changeover? [17:02:04] Reedy: is https://rt.wikimedia.org/Ticket/Display.html?id=8613 something you need an op for? Looks to me like a core/beta issue. [17:11:17] ottomata: http://www.amazon.com/Headset-Buddy-Adapter-Smartphone-01-PH35-PC35/dp/B00332DPDG/ref=sr_1_3?ie=UTF8&qid=1412960954&sr=8-3&keywords=headset+to+split+mic+headphone [17:11:41] hokay thank youuu [17:17:19] !log mw1027 rebuild complete (now with HHVM goodness in every bite!) [17:17:25] Logged the message, Master [17:30:43] (03PS2) 10Ottomata: Hadoop fairscheduler queue change - remove 'adhoc' queue, rename 'standard' to 'essential'. [puppet] - 10https://gerrit.wikimedia.org/r/165512 [17:30:54] (03CR) 10Ottomata: [C: 032 V: 032] Hadoop fairscheduler queue change - remove 'adhoc' queue, rename 'standard' to 'essential'. [puppet] - 10https://gerrit.wikimedia.org/r/165512 (owner: 10Ottomata) [17:35:28] PROBLEM - puppet last run on ms-fe3001 is CRITICAL: CRITICAL: Puppet has 1 failures [17:45:46] (03PS1) 10Chad: Phabricator: Ensure root directory is created before cloning in it [puppet] - 10https://gerrit.wikimedia.org/r/166067 [17:46:56] (03CR) 10Rush: [C: 032 V: 032] "thanks man was at the back of my mind for awhile and never got to do it" [puppet] - 10https://gerrit.wikimedia.org/r/166067 (owner: 10Chad) [17:48:38] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [17:48:47] (03PS1) 10Chad: T620: Timezone should not default to America/Los_Angeles [puppet] - 10https://gerrit.wikimedia.org/r/166068 [17:53:09] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [17:53:48] RECOVERY - puppet last run on ms-fe3001 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [17:57:20] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:57:39] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:57:39] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:57:39] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:57:39] PROBLEM - Apache HTTP on mw1159 is CRITICAL: Connection timed out [17:57:59] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:58:52] PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:59:03] PROBLEM - Apache HTTP on mw1154 is CRITICAL: Connection timed out [17:59:42] PROBLEM - Apache HTTP on mw1156 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:02:32] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.086 second response time [18:02:39] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.077 second response time [18:02:49] !log begin reimaging of mw1026 [18:02:54] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.061 second response time [18:02:54] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.083 second response time [18:02:55] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.077 second response time [18:02:55] Logged the message, Master [18:03:02] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.088 second response time [18:03:02] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.083 second response time [18:03:13] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.080 second response time [18:03:23] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 71664 bytes in 0.417 second response time [18:07:45] (03PS1) 10coren: mw102[67] to appserver_hhvm [puppet] - 10https://gerrit.wikimedia.org/r/166073 [18:08:45] (03CR) 10coren: [C: 032] "Trivial" [puppet] - 10https://gerrit.wikimedia.org/r/166073 (owner: 10coren) [18:16:52] PROBLEM - puppet last run on mw1026 is CRITICAL: Timeout while attempting connection [18:18:13] PROBLEM - Host mw1026 is DOWN: PING CRITICAL - Packet loss = 100% [18:18:41] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [18:28:44] (03CR) 10Physikerwelt: "This woule enable I455b41c8b8d918f4c34f6c115194d227a8394e0a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/158559 (https://bugzilla.wikimedia.org/49169) (owner: 10Physikerwelt) [18:30:18] (03CR) 10Jforrester: [C: 031] T620: Timezone should not default to America/Los_Angeles [puppet] - 10https://gerrit.wikimedia.org/r/166068 (owner: 10Chad) [18:32:27] * YuviPanda gently pokes andrewbogott with https://gerrit.wikimedia.org/r/#/c/165956/ [18:32:41] andrewbogott: is ok if you're busy, I can poke on monday again, but do merge if you can :) [18:33:00] (03PS3) 10Dzahn: [RFC] add reports.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/165927 [18:33:22] YuviPanda: That patch is merged? [18:33:58] Coren: hmm, not that I can see... https://gerrit.wikimedia.org/r/#/c/165956/ [18:34:19] Coren: it still says 'Can merge: Yes' [18:34:20] Oh, it's +2'ed but not merged. Bad Dzahn [18:34:38] Coren: I think that's because of missing dependencies that have since been merged, so someone would need to hit submit on this one, perhaps [18:34:41] Coren: YuviPanda: some time review sprint for Tim L. ? https://gerrit.wikimedia.org/r/#/q/project:operations/puppet+status:open+owner:%22Tim+Landscheidt+%253Ctim%2540tim-landscheidt.de%253E%22,n,z [18:35:12] YuviPanda: Merged. [18:35:21] (03CR) 10Yuvipanda: [C: 04-1] "We collect metrics for everything now :)" [puppet] - 10https://gerrit.wikimedia.org/r/157620 (owner: 10Tim Landscheidt) [18:35:35] i'd merge this one https://gerrit.wikimedia.org/r/#/c/145573/ [18:35:45] Coren: down the dependency chain? :) [18:36:47] (03CR) 10Gage: [C: 031] "Thanks, I've ran into this too. I also had to remove these lines from the same section in order to bring up a node which is both namenode " [puppet] - 10https://gerrit.wikimedia.org/r/164763 (owner: 10QChris) [18:36:49] mutante: didn't we have some vague troubles with the .gz file being gzipped? [18:38:22] Coren: 2 comments about the +2, "if it wouldn't have had the dependency i would have merged, but i dont want to have to merge all of it".. and the other "but i saw others do this all the time .. +2 without submitting" [18:38:48] where it means "i really checked it" as opposed to just "lgtm" [18:39:23] mutante: We need a different convention because there are cases where I'd want to +2 something as "This is Good" but cannot merge for some reason. [18:39:54] It may be all moot eventually anyways because Phab [18:40:43] (03CR) 10coren: [C: 032] "Sane enough." [puppet] - 10https://gerrit.wikimedia.org/r/165957 (owner: 10Yuvipanda) [18:40:57] * YuviPanda wonders if he should work without as many dependencies, but that seems to be more complicated + prone to cause merge conflicts [18:41:01] yea, we should have some more docs on the reviewing process and what we mean and how me merge [18:42:08] also about "should i merge somebody else's change or not (when they also have +2)" [18:42:13] (03CR) 10coren: [C: 032] "Make the previous useful." [puppet] - 10https://gerrit.wikimedia.org/r/165961 (owner: 10Yuvipanda) [18:42:34] i actually like it if others just merge.. i would say if something should not be merged the owner should vote it down [18:43:16] mutante: yeah, I think anything without an explicit -2 (or -1 in ops/puppet case, since I can't -2) or [WIP] is fair game for reviews/merging, but with ops the person merging also has some responsibility of seeing the change through... [18:43:28] YuviPanda: i think you should remove the dependencies unless they are actual code dependencies but i didn't want to be that annoying guy who votes it down for that :p [18:43:45] <_joe_> whoever gives +2 is fully responsible for what he/she merges [18:44:12] (03PS2) 10Yuvipanda: shinken: Use appropriate notification command [puppet] - 10https://gerrit.wikimedia.org/r/166053 [18:45:05] (03CR) 10coren: [C: 032] shinken: Use appropriate notification command [puppet] - 10https://gerrit.wikimedia.org/r/166053 (owner: 10Yuvipanda) [18:45:27] _joe_: Yeah, but that leaves us with a "I approve of this and reviewed it carefully" [18:45:36] s/with/without/ [18:45:47] Coren: yay! :) I've it already running on shinken-server-01 :) can you force a run on neon? [18:48:04] YuviPanda: {{done}} [18:48:17] Coren: icinga -v /etc/icinga/icinga.cfg? [18:48:42] I should say, {{doing}} [18:48:53] I've {{done}} starting it, it's still in progress. :_) [18:50:48] Now puppet itself is also {{done}} :-) [18:55:45] (03CR) 10Andrew Bogott: "Proper heira support on labs is a ways off -- would you like me to just merge this in the meantime?" [puppet] - 10https://gerrit.wikimedia.org/r/165770 (owner: 10Reedy) [18:56:27] Coren: cool :) and icinga is fine, I guess? [18:58:53] mutante: do you mind handling another bugzilla ticket? https://rt.wikimedia.org/Ticket/Display.html?id=8628 [18:59:15] It appears to be [18:59:42] andrewbogott: I can rejigger the image for the volume stuff, but are you done with the other issues you had? [18:59:52] andrewbogott: ah, hi, didn't see you on IRC earlier, so i wanted to reply to the last request. i did solve that "merge BZ user" RT ticket [18:59:56] Coren: cool, thanks [19:00:11] andrewbogott: yea, taking that too [19:01:10] (03CR) 10Alex Monk: [C: 031] T620: Timezone should not default to America/Los_Angeles [puppet] - 10https://gerrit.wikimedia.org/r/166068 (owner: 10Chad) [19:03:04] PROBLEM - nutcracker port on mw1026 is CRITICAL: Connection refused by host [19:03:14] mutante: thank you! I've been living bouncerless this week, probably need to do something about that. [19:03:14] (03CR) 10Dzahn: [C: 031] T620: Timezone should not default to America/Los_Angeles [puppet] - 10https://gerrit.wikimedia.org/r/166068 (owner: 10Chad) [19:03:23] PROBLEM - nutcracker process on mw1026 is CRITICAL: Connection refused by host [19:03:23] PROBLEM - DPKG on mw1026 is CRITICAL: Connection refused by host [19:03:33] PROBLEM - puppet last run on mw1026 is CRITICAL: Connection refused by host [19:03:34] PROBLEM - Disk space on mw1026 is CRITICAL: Connection refused by host [19:03:37] Coren: yes, I'm happy with the current images, other than having broken lvm [19:03:37] How many +1s does it need? :-) [19:03:49] (03CR) 10Dzahn: [C: 032] T620: Timezone should not default to America/Los_Angeles [puppet] - 10https://gerrit.wikimedia.org/r/166068 (owner: 10Chad) [19:04:12] Yay. [19:04:14] PROBLEM - RAID on mw1026 is CRITICAL: Connection refused by host [19:04:34] PROBLEM - check configured eth on mw1026 is CRITICAL: Connection refused by host [19:04:53] PROBLEM - check if dhclient is running on mw1026 is CRITICAL: Connection refused by host [19:05:03] PROBLEM - check if salt-minion is running on mw1026 is CRITICAL: Connection refused by host [19:10:23] RECOVERY - RAID on mw1026 is OK: OK: no RAID installed [19:10:23] RECOVERY - nutcracker process on mw1026 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [19:10:24] RECOVERY - DPKG on mw1026 is OK: All packages OK [19:10:34] RECOVERY - Disk space on mw1026 is OK: DISK OK [19:10:43] RECOVERY - check configured eth on mw1026 is OK: NRPE: Unable to read output [19:10:53] RECOVERY - check if dhclient is running on mw1026 is OK: PROCS OK: 0 processes with command name dhclient [19:11:05] RECOVERY - check if salt-minion is running on mw1026 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [19:11:15] RECOVERY - nutcracker port on mw1026 is OK: TCP OK - 0.000 second response time on port 11212 [19:11:44] PROBLEM - puppet last run on mw1026 is CRITICAL: CRITICAL: Puppet has 1 failures [19:19:33] PROBLEM - Apache HTTP on mw1157 is CRITICAL: Connection timed out [19:19:43] PROBLEM - Apache HTTP on mw1160 is CRITICAL: Connection timed out [19:19:44] PROBLEM - Apache HTTP on mw1158 is CRITICAL: Connection timed out [19:19:45] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:19:53] PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:19:53] PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:20:19] uhm.. [19:20:23] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.314 second response time [19:20:34] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.080 second response time [19:20:34] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.091 second response time [19:20:43] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.077 second response time [19:20:44] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.085 second response time [19:20:45] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.092 second response time [19:21:15] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [19:22:06] ori: they are imagescalers [19:22:11] ori: ps aux | grep MAGICK [19:22:49] looking, though i'm not very experienced with imagemagick stuff [19:23:54] yea, so already recovered and they are just making thumbnails out of large jpeg's afaict [19:23:58] yeah [19:23:58] http://ganglia.wikimedia.org/latest/stacked.php?m=ap_busy_workers&c=Image%20scalers%20eqiad&r=hour&st=1412969010&host_regex= [19:26:30] (03CR) 10Dzahn: [C: 032] labsdebrepo: Fix initial run [puppet] - 10https://gerrit.wikimedia.org/r/145573 (owner: 10Tim Landscheidt) [19:28:06] (03PS3) 10Hashar: contint: Sikuli is no longer used anywhere [puppet] - 10https://gerrit.wikimedia.org/r/165204 (https://bugzilla.wikimedia.org/54393) (owner: 10Zfilipin) [19:28:42] (03CR) 10Hashar: "Cherry picked on tip of production branch to get rid of the other dependency" [puppet] - 10https://gerrit.wikimedia.org/r/165204 (https://bugzilla.wikimedia.org/54393) (owner: 10Zfilipin) [19:29:25] mutante: you +2ed a change that depended on a change which is not merged yet. I have removed the dep and it is good to go now : https://gerrit.wikimedia.org/r/#/c/165204/ :) [19:29:43] (03CR) 10Dzahn: "in this context i'd like to remind again that it would be nice if beta and prod could just use the exact same thing. i know somewhere ther" [puppet] - 10https://gerrit.wikimedia.org/r/164508 (owner: 10Ori.livneh) [19:30:26] hashar: i was looking at it in that second, you read my mind [19:30:35] :p [19:30:39] (03CR) 10Dzahn: [C: 032] contint: Sikuli is no longer used anywhere [puppet] - 10https://gerrit.wikimedia.org/r/165204 (https://bugzilla.wikimedia.org/54393) (owner: 10Zfilipin) [19:30:46] mutante: thanks, will remove the package manually [19:31:02] hashar: i think it's already gone [19:31:11] but good if you check again i was on the right nodes [19:31:25] didnt see it on gallium [19:31:38] and lanthanum [19:31:42] mutante: yeah it is only on the labs instance we use as Jenkins slaves :D [19:32:07] ah, ok [19:33:05] RECOVERY - puppet last run on mw1026 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [19:33:06] (03CR) 10Dzahn: "oh heh, yea, actually https://gerrit.wikimedia.org/r/#/c/160953/ as well" [puppet] - 10https://gerrit.wikimedia.org/r/164508 (owner: 10Ori.livneh) [19:33:14] danke [19:33:53] de rien [19:34:33] mutante: you opened a ticket about getting RT access for Marcel Ruiz Forns, but we don't know his email do we? [19:35:10] andrewbogott: should be mforns@ , i just wasnt sure at the time of ticket creation [19:35:25] (i got that from trying google autocomplete) [19:35:33] (03PS21) 10Hashar: contint: Add Xvfb module, role::ci::slave::localbrowser and Chromium [puppet] - 10https://gerrit.wikimedia.org/r/163791 (owner: 10Krinkle) [19:35:34] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [19:35:49] (03CR) 10Hashar: "rebased to drop sikuli package" [puppet] - 10https://gerrit.wikimedia.org/r/163791 (owner: 10Krinkle) [19:36:11] (03CR) 10Hashar: "and applied PS21 on the contint puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/163791 (owner: 10Krinkle) [19:36:41] andrewbogott: he requested it because it's on some onboarding document to request one, it won't be super important but helpful for him to read older tickets until we actually switched [19:36:45] bbiab,..lunch [19:37:09] hashar: rebase and change in one commit :) [19:37:22] (03CR) 10Krinkle: [C: 031] contint: Add Xvfb module, role::ci::slave::localbrowser and Chromium [puppet] - 10https://gerrit.wikimedia.org/r/163791 (owner: 10Krinkle) [19:37:54] hashar: ah, interesting, the stack changed. the other one got in first. [19:37:58] cool [19:39:06] (03CR) 10Hashar: "I have removed the package on contint labs slaves manually using:" [puppet] - 10https://gerrit.wikimedia.org/r/165204 (https://bugzilla.wikimedia.org/54393) (owner: 10Zfilipin) [19:40:40] Krinkle: yeah it was easier to get merged :D [19:41:09] !log mw1026 rebuild complete (now with HHVM goodness in every bite!) [19:41:13] anyway just went to get that one +2ed heading out again [19:41:18] Logged the message, Master [19:43:08] andrewbogott: It's been months since I last fiddled with this; remind me where the partman recipe for the labs images lives? [20:00:27] (03PS1) 10Andrew Bogott: Add deployment access to the tmh boxes. [puppet] - 10https://gerrit.wikimedia.org/r/166109 [20:03:53] (03CR) 10Andrew Bogott: "I would've thought that unpuppetized user accounts would be purged from this box. Incorrect?" [puppet] - 10https://gerrit.wikimedia.org/r/166109 (owner: 10Andrew Bogott) [20:25:22] (03CR) 10Rush: [C: 031] "afaik this is ok" [puppet] - 10https://gerrit.wikimedia.org/r/166109 (owner: 10Andrew Bogott) [20:25:45] (03PS2) 10Andrew Bogott: Add deployment access to the tmh boxes. [puppet] - 10https://gerrit.wikimedia.org/r/166109 [20:27:43] (03CR) 10Andrew Bogott: [C: 032] Add deployment access to the tmh boxes. [puppet] - 10https://gerrit.wikimedia.org/r/166109 (owner: 10Andrew Bogott) [20:31:17] (03PS1) 10Andrew Bogott: Fix double-include of 'admin' [puppet] - 10https://gerrit.wikimedia.org/r/166134 [20:31:34] PROBLEM - puppet last run on tmh1001 is CRITICAL: CRITICAL: puppet fail [20:32:27] that looks related [20:32:29] andrewbogott: [20:32:40] mutante: yeah, I'm on it [20:32:53] (03CR) 10Andrew Bogott: [C: 032] Fix double-include of 'admin' [puppet] - 10https://gerrit.wikimedia.org/r/166134 (owner: 10Andrew Bogott) [20:34:10] I have questions about fr-*@wikimedia.org membership. [20:34:18] Is there a place I can go for that information? [20:34:31] (03PS2) 10Dzahn: ganglia_new - remove pmtpa from configuration [puppet] - 10https://gerrit.wikimedia.org/r/164498 [20:34:36] awight: yea, IRC.. hold on :) [20:34:49] hehe ok thx [20:35:47] awight: ok, yea, so what do you need. i have the info in front of me but not sure if i should paste it [20:36:33] you are on fr-tech, fr-software-engineers-, fr-online [20:36:43] hehe [20:37:05] mutante: well for now, I'm trying to confirm that ewulczyn_____ has been added to fr-tech and fr-online [20:37:07] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:37:30] awight: i can't confirm that, nope [20:37:33] PROBLEM - Apache HTTP on mw1160 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:37:43] RECOVERY - puppet last run on tmh1001 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [20:37:47] PROBLEM - Apache HTTP on mw1154 is CRITICAL: Connection timed out [20:37:57] PROBLEM - Apache HTTP on mw1153 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:38:05] PROBLEM - Apache HTTP on mw1158 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:38:05] PROBLEM - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:38:05] awight: quick mail to ops-requests@rt.wikimedia.org ? will do it [20:38:06] mutante: erm as in you can't tell me or the answer is NO :) [20:38:12] mutante: ok rad, thank you [20:38:13] PROBLEM - Apache HTTP on mw1156 is CRITICAL: Connection timed out [20:38:15] awight: the answer is no. not on it yet [20:38:29] ok good to know [20:38:45] PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:38:45] PROBLEM - Apache HTTP on mw1157 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:39:29] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [20:43:04] PROBLEM - very high load average likely xfs on ms-be1013 is CRITICAL: CRITICAL - load average: 107.76, 100.33, 76.61 [20:43:24] RECOVERY - Apache HTTP on mw1158 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.377 second response time [20:43:24] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 3.626 second response time [20:43:33] RECOVERY - LVS HTTP IPv4 on rendering.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 71664 bytes in 0.788 second response time [20:44:03] RECOVERY - Apache HTTP on mw1156 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 1.672 second response time [20:44:14] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.058 second response time [20:44:14] RECOVERY - Apache HTTP on mw1157 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.069 second response time [20:44:24] RECOVERY - Apache HTTP on mw1160 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.078 second response time [20:44:34] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.075 second response time [20:44:44] RECOVERY - Apache HTTP on mw1153 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.081 second response time [20:47:57] (03CR) 10Dzahn: [C: 032] ganglia_new - remove pmtpa from configuration [puppet] - 10https://gerrit.wikimedia.org/r/164498 (owner: 10Dzahn) [20:51:50] bd808, had a qn. about logstash. [20:52:05] hashar: I know I keep asking you this, but… ok if I delete the instances in the labs 'ganglia' project? There are three there -- two you created, one I did. [20:52:16] subbu: I'll try to answer. What's up? [20:52:37] so, we are trying to hook up with parsoid with logstash (via gelf). [20:52:53] the qn. is if the hostnames are public (and hence can be in our deploy repo) or should it go via puppet? [20:53:01] https://gerrit.wikimedia.org/r/#/c/166137/1/conf/wmf/localsettings.js for ex. [20:53:40] also, is there anything to do in terms of enabling this for parsoid .. or do we just send packets to that server and everything will magically work from that point on? [20:53:47] s/or// [20:54:20] ah. No problem putting them in your config. they are in the puppet manifest. I would suggest for now that parsoid send messages to logstash1003 [20:54:51] * bd808 looks to see what changes we made for pdf logs [20:54:54] RECOVERY - very high load average likely xfs on ms-be1013 is OK: OK - load average: 44.79, 74.49, 80.00 [20:55:49] cscott, fyi. ^ [20:56:03] subbu: I think you can start by just sending in messages, but you may want to add config to https://github.com/wikimedia/operations-puppet/blob/production/files/logstash/filter-gelf.conf to make your messages prettier/more useful to you. [20:56:28] ok .. thanks. looking. [20:57:03] cool. [20:57:05] andrewbogott: yes. :) [20:57:22] hashar: great, I may actually do it this time! [20:57:28] andrewbogott: I believe ganglia.wmflabs.org has been down for long enough that we can get rid of it [20:57:30] subbu: You can try things out by sending messages into beta's logstash and see how they look. I think mwalker ended up tweaking the config in his app after he saw how the events were recorded. [20:57:38] andrewbogott: and YuviPanda as an OK replacement going on. [20:57:51] ah, ok. what is the beta logstash server name? [20:58:20] andrewbogott: Timo even wrote some frontend this week https://tools.wmflabs.org/nagf/?project=integration [20:58:31] so is gelf config needed since we control our output and massage it to appear how we want? [20:58:34] subbu: GUI is https://logstash-beta.wmflabs.org/ and the host is deployment-logstash1.eqiad.wmflabs [20:59:00] * andrewbogott deletes everything [20:59:26] bd808: what's the transport used by logstash? i've noticed missing log messages for ocg. is that a bad configuration somewhere, or just part of how logstash works? [20:59:26] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [20:59:37] hashar: is the new baby really sleeping so well that you can work at 10 at night while holding her in one arm? [20:59:39] bd808: when did you migrate the login off officewiki? :p [20:59:50] subbu: probably gelf config isn't needed, although looking at that config file it looks like there might be some field names we want to avoid [21:00:00] JohnLewis: When I was trying to get volunteers with NDAs in [21:00:14] andrewbogott: I am holding her crying right know :p she is hungry [21:00:24] :( [21:00:33] ya. i assume since hadoop users don't control hadoop output, the message is being tweaked. [21:00:41] hashar, so each time I type your name in IRC it causes a klaxxon to sound that wakes up your children? [21:00:59] cscott, but, yes, sending to beta and tweaking it from there is a good way to play with this. [21:01:01] bd808: kk - I was considering poking a staffer to grab details for me but never did :) [21:01:52] cscott: The gelf input is http://logstash.net/docs/1.4.2/inputs/gelf I think it accepts both tcp and udp [21:01:56] bd808, so, another qn. can i point my laptop dev. instance to beta-labs logstash? [21:02:20] (03CR) 10Andrew Bogott: "I just deleted ganglia.wmflabs.org. So there's that." [puppet] - 10https://gerrit.wikimedia.org/r/165360 (owner: 10Yuvipanda) [21:02:29] maybe ocg is configured with udp, i should look hard at it. [21:02:39] subbu: Hmmm... probably not, unless you make an ssh tunnel into labs [21:02:44] subbu: and you probably want to double check that gelf-stream is using tcp [21:02:48] k [21:03:01] bd808, subbu: ssh tunnels work well for me [21:03:01] subbu: But I've "been meaning to" make a logstash role for mw-vagrant [21:03:05] (03CR) 10Andrew Bogott: [C: 031] "I like this, but I also like the idea of actively purging ganglia from instances..." [puppet] - 10https://gerrit.wikimedia.org/r/165360 (owner: 10Yuvipanda) [21:03:27] (03PS1) 10coren: Labs: Make images create LVM at firstboot.sh [puppet] - 10https://gerrit.wikimedia.org/r/166139 [21:03:32] i use `ssh -L 17080:ocg.svc.eqiad.wmnet:8000 tin` for instance. [21:03:44] andrewbogott: ^^ Sorry it took longer than I hoped, I decided to write it extra paranoid. [21:03:52] cscott: Be careful that logging doesn't become a bottleneck if you are trying to ensure that no messages are lost. [21:03:55] * andrewbogott reads [21:04:14] PROBLEM - puppet last run on netmon1001 is CRITICAL: CRITICAL: puppet fail [21:04:27] I personally consider app level logging to be valuable but not more valuable than perfomance [21:05:43] PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: puppet fail [21:06:43] andrewbogott: Oh, hang on, we probably want to add a swapon -a if we created a swap. [21:07:16] (03PS2) 10coren: Labs: Make images create LVM at firstboot.sh [puppet] - 10https://gerrit.wikimedia.org/r/166139 [21:08:41] Coren: does this move /root into the control of lvm as well, or does it just create a big lvm partition that occupies the rest of the unpartitioned space? [21:09:24] The latter; everything but /. It would have been _possible_ to do it, but in my book anything that requires pivot_root() is evil. :-) [21:10:24] Oh, wait, /root not the root (/) :-) [21:10:37] No, /root is in / and not /var so it'll never be touched by this. [21:11:03] ok, that's what I thought, was just wondering because of 'all our disk are belong to LVM' [21:11:19] Coren: do you mind adding a couple more comments explaining the cavalcade of sed up top? [21:12:15] # This could probably be done with awk, but (to my shame) I never really gotten used to it. :-) [21:12:22] Sure. [21:12:59] If you'd used regexp I'd probably be even more confused -- no complaints about sed per se :) [21:13:09] echo "foo bar baz" | awk '{print $1 " " $3 }' [21:13:11] foo baz [21:13:18] that is all I know about awk :-( [21:15:10] (03PS3) 10coren: Labs: Make images create LVM at firstboot.sh [puppet] - 10https://gerrit.wikimedia.org/r/166139 (https://bugzilla.wikimedia.org/71873) [21:16:16] ... actually, I can probably simplify that part now that I realize I'm doing the same thing twice [21:17:15] (03PS4) 10coren: Labs: Make images create LVM at firstboot.sh [puppet] - 10https://gerrit.wikimedia.org/r/166139 (https://bugzilla.wikimedia.org/71873) [21:17:26] andrewbogott: ^^ this one is MUCH simpler [21:19:19] Oh, duh! Also completely wrong! Ignore it. [21:19:25] ok! [21:19:39] * andrewbogott turns laptop so the screen is facing the other way [21:19:50] I just found parted can be made to output machine readable format ( : separated ) [21:20:09] sudo parted -m -s /dev/vda print [21:20:26] yields fields like: 3:10.2GB:12.3GB:2048MB:linux-swap(v1)::; [21:21:31] hashar: OoooO! I did not know that. MUCH more reliable. [21:21:39] * Coren adapts. [21:22:02] (03CR) 10Hashar: "Have a look at : parted -m" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/166139 (https://bugzilla.wikimedia.org/71873) (owner: 10coren) [21:22:09] Coren: I have pasted some example :) [21:22:40] when I filled the bug yesterday I was wondering what were all those sed / cut, looked at it and ended up with some numbers I had no idea whether their were fine :/ [21:25:05] RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [21:26:08] Gah, I need to find an instance with unpartitionned space to test. [21:26:38] Coren: will your testing destroy that instance? [21:26:54] andrewbogott: No; I'm just testing that I get the right fields off parted [21:27:06] Probably wikitech-test-horizon has a bit of unpartitioned space. [21:27:08] For example [21:27:42] It does, excellent. [21:27:44] Thanks. [21:28:14] I am quite happy about that parted machine readable option I just found [21:28:20] we should get a --json as well :D [21:30:23] (03CR) 10Chad: [C: 032 V: 032] Update hooks-bugzilla to 6e1e659eedc8719a2a0ea0906266738a18c7aa42 [gerrit/plugins] - 10https://gerrit.wikimedia.org/r/164879 (owner: 10QChris) [21:32:33] (03PS5) 10coren: Labs: Make images create LVM at firstboot.sh [puppet] - 10https://gerrit.wikimedia.org/r/166139 (https://bugzilla.wikimedia.org/71873) [21:32:40] andrewbogott: ^^ cleaner, and more reliable [21:35:02] (03CR) 10Andrew Bogott: [C: 032] "Let's see what it does!" [puppet] - 10https://gerrit.wikimedia.org/r/166139 (https://bugzilla.wikimedia.org/71873) (owner: 10coren) [21:37:48] FYI, this has the desirable side effect that we can just fiddle the firstboot.sh in an instance to partition differently; no need to rebuild an image. [21:41:56] Coren: new images are building… I'm going to run an errand, back in 30. [21:42:29] Coren: we will have a way to migrate existing instances to the new partition scheme ? [21:42:44] Coren: the beta cluster instances might benefit from it (no idea really) [21:43:05] <^d> Oooh, does this fix our "Error: Can't create any more partitions." on trusty images? [21:43:28] * ^d just caught scrollback [21:44:25] ^d: It should, once an image is built with it. [21:44:36] ^d: But also it makes /var and /var/log resizeable [21:44:55] <^d> Ah, hmm [21:48:53] ^d: That said, andrewbogott is currently building an image, I think. [21:49:14] <^d> Hmm, ok. [21:49:38] <^d> If /var's resizable, does it mean setups like https://phabricator.wikimedia.org/P8 we have on precise won't be necessary? [21:52:31] when we migrated to eqiad, we ended up with a puppet class to allocate everything to /srv/ [21:52:36] Well, it's probably wise to keep a separate partition for application data that grows rather than keep it all in /var; you can already do that with LVM today. But yeah, you /could/ resize /var if you needed to [21:52:38] using the neat role::labs::lvm::srv [21:52:57] but that causes a bunch of issues since we have to symlink from /var//lib/whatever to a dir in /srv/whatever [21:52:58] not idea [21:53:00] And yes, /srv is a better location. I don't get why Ubuntu didn't already correct that. [21:54:14] The /best/ solution is to have per-application volumes mounted in /srv, and configure the application to point it there - this offers maximum reliability [21:54:33] It also means changing stuff. Inertia is hard. :-) [21:54:52] Or we could always go /a :-) [21:56:22] at previous job we did that [21:56:30] /srv/application1/ [21:56:57] and we had a bunch of partitions for that app. var log db tmp software [21:57:07] (using french names of courses hehe) [21:57:18] hashar: It's a bit troublesome to set up, but by far the most reliable setup. [21:57:33] even the batch scripts names had to follow a normalization [21:57:49] so you would run /srv/application1/software/v1.0/batch/001.sh [21:57:52] crazy [21:58:22] Personally, I think that goes a bit overboard. A partition for each of /srv/whatever I can see, /srv/wathever/log is defensible. [21:58:46] (03PS1) 10Bene: Add "recommended article" and "featured list" badge [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166144 (https://bugzilla.wikimedia.org/70268) [21:58:57] temp and persistent storage might as well [21:59:03] but yeah, that is probably overkill [21:59:17] the idea was that applications ended up more or less isolated [21:59:24] eventually they moved to Solaris 10 and containers [21:59:34] (more or less like LXC on linux as I understood it) [22:00:02] Yeah, I did a lot of Solaris in the past. It's kinda-sorta. [22:00:12] Only thing I really miss from Solaris is zfs [22:00:48] cause it is has the equivalent of LVM build in isn't it ? [22:01:30] (03CR) 10Bene: "Note that the badges will be displayed with the default blue icon if we merge this right away. If we want other icons we have to add the r" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/166144 (https://bugzilla.wikimedia.org/70268) (owner: 10Bene) [22:01:42] It's pretty much a well integrated filesystem, lvm, and software raid-ish. There are downsides to putting all the eggs in one basket, but nobody can deny the some the niceties of every bit understanding every other. [22:02:17] Also, the actual filesystem itself was pretty nice and robust. [22:02:50] poor Sun. RIP [22:03:15] anyway it is late there. Thanks Coren to have worked on the labs partition mess :] [22:03:24] (03PS3) 10Dzahn: delete contacts.wikimedia.org SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/164695 [22:03:26] Yeah, memories of Digital. Another great engineering victim to commercial greed. [22:03:31] hashar: Bonne nuit! [22:03:39] Merci et bon week-end! [22:04:57] (03PS3) 10Dzahn: remove virt-star.pmtpa SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/164696 [22:06:31] (03PS3) 10Dzahn: delete nfs[12].pmtpa SSL certs [puppet] - 10https://gerrit.wikimedia.org/r/164694 [22:09:32] bd808, one more thing i forgot .. what port (for the beta labs logstash)? [22:10:28] subbu: 12201 -- https://github.com/wikimedia/operations-puppet/blob/production/manifests/role/logstash.pp#L70-L72 [22:12:05] thanks. [22:14:48] (03PS4) 10Dzahn: delete nfs[12].pmtpa SSL certs [puppet] - 10https://gerrit.wikimedia.org/r/164694 [22:15:27] (03CR) 10Dzahn: [C: 032] "keys deleted in private repo as well" [puppet] - 10https://gerrit.wikimedia.org/r/164694 (owner: 10Dzahn) [22:27:02] mutante: sorry :) just saw the description and the time confused me :p [22:27:42] JohnLewis: no worries, thanks for volunteering. i just found that one of them is still installed by puppet f.e. [22:28:23] Ah [22:29:50] (03PS1) 10Dzahn: contacts.wm - remove ferm/443, remove SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/166147 [22:30:32] JohnLewis: ^ that gerrit user is you, right [22:30:36] the one i added [22:31:00] mutante: well I got three emails on just prefixed [Gerrit] so I assume you got the right one :) [22:31:37] heh, ok [22:32:14] I assume you want a review :p [22:32:37] it was something between that and telling you why i took the ticket back :) [22:33:01] well eitherway [22:33:04] (03CR) 10John F. Lewis: [C: 031] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/166147 (owner: 10Dzahn) [22:33:49] (03CR) 10Dzahn: [C: 032] "contacts.wikimedia.org is an alias for misc-web-lb.eqiad.wikimedia.org." [puppet] - 10https://gerrit.wikimedia.org/r/166147 (owner: 10Dzahn) [22:34:54] cscott_away: ah, missed you. wanted to ask if the Bugzilla merge is all good [22:35:10] user merge, not code merge that is , heh [22:36:12] (03CR) 10Dzahn: "Filebucketed /etc/ferm/conf.d/10_contacts_https" [puppet] - 10https://gerrit.wikimedia.org/r/166147 (owner: 10Dzahn) [22:42:47] (03CR) 10Dzahn: "how do file deletions like this require rebasing?" [puppet] - 10https://gerrit.wikimedia.org/r/164695 (owner: 10Dzahn) [22:42:59] (03PS4) 10Dzahn: delete contacts.wikimedia.org SSL cert [puppet] - 10https://gerrit.wikimedia.org/r/164695 [22:43:33] (03CR) 10Dzahn: [C: 032] "deleted keys in private repo. shred'ed key on zirconium and deleted certs. removed from puppet role" [puppet] - 10https://gerrit.wikimedia.org/r/164695 (owner: 10Dzahn) [23:08:08] ori: yooo, how's that thar kafktee? [23:21:59] (03PS1) 10Chad: Add my default .ackrc [puppet] - 10https://gerrit.wikimedia.org/r/166151 [23:24:01] (03PS2) 10Chad: Add my default .ackrc [puppet] - 10https://gerrit.wikimedia.org/r/166151 [23:25:17] (03CR) 10Dzahn: [C: 032] "Robh, follow-up ticket created, 8631. deleted key from private repo. shred'ed key and cert from terbium. Apache config is already gone." [puppet] - 10https://gerrit.wikimedia.org/r/164001 (owner: 10Dzahn) [23:27:34] (03CR) 10Dzahn: [C: 032] Add my default .ackrc [puppet] - 10https://gerrit.wikimedia.org/r/166151 (owner: 10Chad) [23:28:14] <^d> mutante: thx :) [23:29:07] yw [23:29:15] (03PS1) 10Chad: Correct syntax helps [puppet] - 10https://gerrit.wikimedia.org/r/166153 [23:29:19] <^d> Too bad I screwed up :p [23:29:36] <^d> I dunno how I failed so bad copying that from my localhost. [23:30:30] (03PS2) 10Dzahn: home/demon/.ackrc - fix syntax [puppet] - 10https://gerrit.wikimedia.org/r/166153 (owner: 10Chad) [23:30:39] (03CR) 10Dzahn: [C: 032] home/demon/.ackrc - fix syntax [puppet] - 10https://gerrit.wikimedia.org/r/166153 (owner: 10Chad) [23:31:16] (03CR) 10Dzahn: [V: 032] home/demon/.ackrc - fix syntax [puppet] - 10https://gerrit.wikimedia.org/r/166153 (owner: 10Chad) [23:37:44] (03PS4) 10Dzahn: [RFC] add reports.wikimediafoundation.org [dns] - 10https://gerrit.wikimedia.org/r/165927 [23:37:54] (03CR) 10jenkins-bot: [V: 04-1] [RFC] add reports.wikimediafoundation.org [dns] - 10https://gerrit.wikimedia.org/r/165927 (owner: 10Dzahn) [23:38:37] (03PS5) 10Dzahn: [RFC] add reports.wikimediafoundation.org [dns] - 10https://gerrit.wikimedia.org/r/165927 [23:38:45] (03CR) 10jenkins-bot: [V: 04-1] [RFC] add reports.wikimediafoundation.org [dns] - 10https://gerrit.wikimedia.org/r/165927 (owner: 10Dzahn) [23:39:45] (03CR) 10Dzahn: "heh, what.." [dns] - 10https://gerrit.wikimedia.org/r/165927 (owner: 10Dzahn) [23:41:41] (03CR) 10Dzahn: [C: 04-2] [RFC] add reports.wikimediafoundation.org [dns] - 10https://gerrit.wikimedia.org/r/165927 (owner: 10Dzahn) [23:42:31] (03CR) 10Dzahn: "how exactly does get wikipediazero in here?" [dns] - 10https://gerrit.wikimedia.org/r/165927 (owner: 10Dzahn) [23:45:38] (03CR) 10Dzahn: [C: 031] "bumping up" [dns] - 10https://gerrit.wikimedia.org/r/138568 (owner: 10Filippo Giunchedi)