[00:00:04] RoanKattouw, ^d, marktraceur, MaxSem, James_F: Dear anthropoid, the time has come. Please deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141105T0000). [00:00:17] <^d|voted> I've got it, unless someone has a burning desire. [00:00:54] (03CR) 10Chad: [C: 032] Adjust number of content shards for largest wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170173 (owner: 10Chad) [00:02:02] (03Merged) 10jenkins-bot: Adjust number of content shards for largest wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170173 (owner: 10Chad) [00:02:27] (03CR) 10Chad: [C: 032] Adjust number of replicas for de/enwiki content and commons file index [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170182 (owner: 10Chad) [00:03:11] (03Merged) 10jenkins-bot: Adjust number of replicas for de/enwiki content and commons file index [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170182 (owner: 10Chad) [00:03:55] !log demon Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 07s) [00:03:59] Logged the message, Master [00:04:05] !log demon Synchronized wmf-config/CirrusSearch-common.php: (no message) (duration: 00m 04s) [00:04:10] Logged the message, Master [00:06:50] bblack: agreed on TAI, but depending on the whims of ITA we might actually get to have our cake & eat it too [00:08:32] (03PS2) 10Dzahn: elasticsearch: only show comment when section exists [puppet] - 10https://gerrit.wikimedia.org/r/169703 (owner: 10Chad) [00:08:34] ^d|voted: Still good to go, sorry, IRC might have swallowed my original? [00:08:44] <^d|voted> Must've. Ok, I'll start merging yours. [00:09:12] (03CR) 10Dzahn: [C: 032] elasticsearch: only show comment when section exists [puppet] - 10https://gerrit.wikimedia.org/r/169703 (owner: 10Chad) [00:09:21] * James_F crosses his fingers. [00:11:44] <^d|voted> blargh, should've waited for mine to finish, now they're all chained in the same queue :( [00:11:52] Yay jenkins. [00:11:54] * ^d|voted goes and grabs a bottle of water, grumbles about testing [00:19:36] <^d|voted> James_F: Ugh :( https://integration.wikimedia.org/zuul/ [00:20:22] :-( [00:20:42] <^d|voted> I'm going to go ahead and merge to the deploy branches and skip jenkins. [00:20:47] <^d|voted> I won't tell anyone if you don't ;-) [00:22:04] * James_F coughs. :-) [00:22:12] They've already passed V+2 [00:22:33] !log demon Synchronized php-1.25wmf5/extensions/CirrusSearch/maintenance/updateOneSearchIndexConfig.php: (no message) (duration: 00m 04s) [00:22:39] (03PS2) 10Ori.livneh: memcached: tidy [puppet] - 10https://gerrit.wikimedia.org/r/171153 [00:22:40] Logged the message, Master [00:22:53] !log demon Synchronized php-1.25wmf6/extensions/CirrusSearch/maintenance/updateOneSearchIndexConfig.php: (no message) (duration: 00m 04s) [00:22:59] Logged the message, Master [00:23:31] !log demon Synchronized php-1.25wmf5/includes/parser/CoreTagHooks.php: (no message) (duration: 00m 05s) [00:23:35] Logged the message, Master [00:23:41] !log demon Synchronized php-1.25wmf5/includes/parser/Parser.php: (no message) (duration: 00m 04s) [00:23:46] Logged the message, Master [00:23:54] !log demon Synchronized php-1.25wmf6/includes/parser/CoreTagHooks.php: (no message) (duration: 00m 04s) [00:24:00] Logged the message, Master [00:24:14] !log demon Synchronized php-1.25wmf6/includes/parser/Parser.php: (no message) (duration: 00m 04s) [00:24:19] Logged the message, Master [00:24:20] <^d|voted> James_F: We're both all live :) [00:25:17] * James_F tests. [00:25:41] ^d|voted: Nothing seems on fire. Success? [00:25:55] <^d|voted> We win the prize today :D [00:27:06] Yay. [00:27:22] (03Abandoned) 10Yurik: Enable ZeroPortal lua extensions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/162855 (owner: 10Yurik) [00:32:55] (03CR) 10Dzahn: mediawiki: tidy /tmp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/168999 (owner: 10Ori.livneh) [00:35:32] (03CR) 10Dzahn: "sounds good, but maybe we should manually empty it before this is merged? i dunno, but is this a problem when it says "This resource type " [puppet] - 10https://gerrit.wikimedia.org/r/168999 (owner: 10Ori.livneh) [00:38:33] James_F: ^d|voted: thanks for the deployment! [00:38:37] <^d|voted> yw [00:39:57] (03CR) 10Dzahn: "Mark, what do you think, take this or was this one you meant when you said some places should be replaced with codfw" [puppet] - 10https://gerrit.wikimedia.org/r/164273 (owner: 10Hoo man) [00:45:38] PROBLEM - HHVM processes on mw1017 is CRITICAL: PROCS CRITICAL: 0 processes with command name hhvm [00:45:42] <^d|voted> !log elasticsearch: rebuilding all cirrus indexes for all wikis from a screen on terbium, going to take awhile. should be boring, but if causing problems kill it first and then find me. [00:45:52] Logged the message, Master [00:46:29] PROBLEM - DPKG on mw1017 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [00:46:38] RECOVERY - HHVM processes on mw1017 is OK: PROCS OK: 1 process with command name hhvm [00:51:06] <^d|voted> I can't ping elastic1022. [00:52:13] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 368 seconds [00:52:34] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 379 seconds [00:53:44] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [00:54:02] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -0 seconds [00:54:52] RECOVERY - DPKG on mw1017 is OK: All packages OK [00:56:26] <^d|voted> mutante: Could you maybe give elastic1022 a kick to the rear? [00:57:07] ^d|voted: ok, i would like to trade for someboy kicking deployment-cache-mobile03 [00:57:44] <^d|voted> Hmm [00:58:38] ^d|voted: it went down a couple hours ago in Icinga, but on mgmt i see login [00:58:53] do we know anything else? like has it been worked on? [00:59:09] <^d|voted> Lemme check SAL, but 1022 is one of the new ones. [00:59:15] <^d|voted> Shouldn't have had anything since last week. [00:59:46] <^d|voted> Got nothing in SAL [01:00:29] <^d|voted> Yeah, still nothing on ping or ssh. [01:01:02] PROBLEM - puppet last run on mw1017 is CRITICAL: CRITICAL: Puppet has 1 failures [01:01:20] down since 5h 24m, normal pass doesn't work, also not the new_install key [01:01:56] nothing in RT, powercycling now [01:02:13] !log powercycling elastic1022 [01:02:20] Logged the message, Master [01:02:21] i'll look at mw1017 [01:02:44] 'k, cool [01:04:06] error: diskfilter writes are not supported. [01:04:06] Press any key to continue... [01:04:12] PROBLEM - Apache HTTP on mw1159 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:04:15] hmm..but then it continues [01:04:25] <^d|voted> I'm not sure what I'm supposed to be seeing on deployment-cache-mobile03. [01:04:26] * Starting Elasticsearch Server [ OK ] [01:05:02] PROBLEM - Apache HTTP on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:05:03] RECOVERY - Apache HTTP on mw1159 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.083 second response time [01:05:13] <^d|voted> I tried restarting ngnix and varnish, but both logs seem empty-ish. [01:05:41] ^d|voted: hmmm.. i said it because of https://bugzilla.wikimedia.org/show_bug.cgi?id=72997 [01:05:47] http://en.m.wikipedia.beta.wmflabs.org/w/index.php?title=Special:UserLogin [01:06:00] " via deployment-cache-mobile03 deployment-cache-mobile03" [01:06:01] RECOVERY - Apache HTTP on mw1155 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.071 second response time [01:06:19] <^d|voted> fcgi errors in /data/project/logs/apache.log [01:06:28] and elastic1022, the server comes up but no ssh [01:07:00] <^d|voted> s/logs/syslog/ [01:07:03] !log upgrading HHVM app servers to 3.3.0+dfsg1-1+wm2 [01:07:14] Logged the message, Master [01:07:29] mutante, ^d|voted: mobile front end code is broken -- https://bugzilla.wikimedia.org/show_bug.cgi?id=72997 [01:07:31] ^d|voted: bd808 commenting on the issue in labs [01:07:36] <^d|voted> mutante: Hmmm, weird. Ok, don't worry about it tonight. I'll ban the IP from allocation in case it starts flapping or something. [01:07:48] ^d|voted: i cant login on it .. hmm [01:07:54] ^d|voted: ok [01:08:09] bd808: gotcha! thanks [01:08:29] ori: What magic do we need to setup in labs to get the hhvm error logs into logstash? [01:08:45] bd808: how do you get any error log into logstash? [01:08:57] from log2udp [01:09:06] on deployment-bastion [01:09:18] but the hhvm logs don't seem to go there in beta [01:09:39] ^d|voted: i got on it .. with the _old_ root pass [01:09:48] <^d|voted> weirddd. [01:09:54] yes, because you said new ones [01:10:23] ok, so the network cable is disconnected or so [01:10:28] 2: eth0: i think we should ask dc-tech [01:10:40] <^d|voted> Hehehe, that'd explain why we can't reach it :p [01:10:43] yes [01:11:17] bd808: not sure why. if you can spare the time to take a look, that'd be awesome; if you can't, file a bug and assign it to me. [01:11:19] !log elatic1022 - eth0: Logged the message, Master [01:11:36] ori: We don't seem to have any apache2 logs in beta since 2014-07-23. I'm guessing that means something changed in the config for syslog [01:11:53] bd808: are you sure? check /var/log/apache2/other_vhosts_something [01:12:02] ^d|voted: making RT [01:12:05] <^d|voted> !log elastic1022: banned from allocation since its unreachable. just in case it starts flapping. [01:12:11] Logged the message, Master [01:12:18] <^d|voted> mutante: ty! [01:12:31] PROBLEM - Apache HTTP on mw1154 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:12:49] what's up with mw1154? [01:12:57] ori: Looking in /data/project/logs where all the other logs are written. Our version of fluorine [01:13:22] RECOVERY - Apache HTTP on mw1154 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 400 bytes in 0.076 second response time [01:13:48] imagescaler that was busy resizing stuff,, imagemagick [01:13:52] and is now done.. i think [01:14:00] re 1154 [01:15:13] mutante: thanks [01:17:52] RECOVERY - puppet last run on mw1017 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [01:18:49] ACKNOWLEDGEMENT - Host elastic1022 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn RT #8811 and has been banned from allocation [01:21:39] !log ori Synchronized php-1.25wmf6/extensions/MobileFrontend: Ic26f56c0d: Update MobileFrontend for cherry-picks (duration: 00m 05s) [01:21:43] !log ori Synchronized php-1.25wmf5/extensions/MobileFrontend: Ic82ba72b98: Update MobileFrontend for cherry-picks (duration: 00m 04s) [01:21:50] Logged the message, Master [01:21:57] Logged the message, Master [01:33:24] ori: http://en.wikipedia.beta.wmflabs.org/wiki/User:Jdforrester_(WMF)/Sandbox?debug=true also 503s – same issue? [01:33:46] Oh, now non-debug=true also 503s. I assume HHVM is just restarting. [01:33:51] * James_F waits. [01:33:52] doesn't 503 for me [01:34:13] Still 503ing here. [01:34:19] Aha, now it doesn't. [01:34:27] Thanks. [01:42:20] ori: I found an existing bug for the beta logging problem and added some data about when the logs stopped flowing -- https://bugzilla.wikimedia.org/show_bug.cgi?id=72275#c5 [01:43:09] ori: Reedy is already assigned to the bug some maybe you could give hi some tips about where to look for broken bits in the rsyslog transit stream [01:43:16] *so maybe [01:43:21] him * [01:43:54] kk [01:43:55] Carmela: Thanks. When my typing fails me it fails badly. [01:44:19] Fail fast, as they say! [02:23:00] (03PS1) 10Ori.livneh: $wgPercentHHVM: 5 => 10, to test https://phabricator.wikimedia.org/T820#18870 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171191 [02:24:17] (03CR) 10Ori.livneh: [C: 032] $wgPercentHHVM: 5 => 10, to test https://phabricator.wikimedia.org/T820#18870 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171191 (owner: 10Ori.livneh) [02:24:25] (03Merged) 10jenkins-bot: $wgPercentHHVM: 5 => 10, to test https://phabricator.wikimedia.org/T820#18870 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171191 (owner: 10Ori.livneh) [02:25:28] !log ori Synchronized wmf-config/CommonSettings.php: If866e9caf: $wgPercentHHVM: 5 => 10, to test https://phabricator.wikimedia.org/T820#18870 (duration: 00m 04s) [02:25:37] Logged the message, Master [02:27:29] (03PS1) 10Dzahn: toollabs: install p7zip-full on exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/171192 [02:29:43] (03CR) 10Dzahn: "dev_environ.pp: 'p7zip-full', # requested by Betacommand to extract files using 7zip" [puppet] - 10https://gerrit.wikimedia.org/r/171192 (owner: 10Dzahn) [02:47:51] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2014: active_shards: 6037: relocating_shards: 5: initializing_shards: 1: unassigned_shards: 1 [02:48:51] RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2015: active_shards: 6040: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [03:02:54] PROBLEM - ElasticSearch health check on elastic1021 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2014: active_shards: 6037: relocating_shards: 5: initializing_shards: 1: unassigned_shards: 1 [03:02:54] PROBLEM - ElasticSearch health check on elastic1030 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2014: active_shards: 6037: relocating_shards: 5: initializing_shards: 1: unassigned_shards: 1 [03:04:01] RECOVERY - ElasticSearch health check on elastic1021 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2015: active_shards: 6040: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0 [03:04:06] RECOVERY - ElasticSearch health check on elastic1030 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2015: active_shards: 6040: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0 [03:09:16] (03PS1) 10Dzahn: add monitoring for search.wm (Apple dict bridge) [puppet] - 10https://gerrit.wikimedia.org/r/171193 [03:11:30] <^d|voted> ew :\ [03:14:33] (03CR) 10Chad: "One inline nit, otherwise ok." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/171193 (owner: 10Dzahn) [03:14:41] heh :p you want to kill the whole thing, dont you [03:14:47] <^d|voted> I do. [03:14:59] people will always wonder if search.wm.org is elastic [03:15:03] <^d|voted> Newer OSXs don't use the old API anymore. I want some logging to see how often it actually gets hit. [03:15:22] ah, i didnt know that part about newer OSXs [03:15:35] good,yea [03:15:43] <^d|voted> Yeah, at least 10.9 and 10.10 just make requests to the normal MW api so Apple's improved it. [03:15:49] <^d|voted> I'm not sure when the change happened though. [03:15:57] you know, i just picked "lucene" because it existed [03:16:03] and i had to put it somewhere [03:16:20] <^d|voted> Do we have a group for api things? [03:16:22] avoiding some icinga error saying "meh, can't find service group, dieing" [03:16:34] checking [03:17:17] well, we have the API servers, like "api_appserver_eqiad" [03:17:35] and we have a "host" api.svc.eqiad.wmnet in group LVS [03:17:37] hrmm [03:18:11] <^d|voted> I'm looking at https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?servicegroup=all&style=summary&nostatusheader [03:18:27] sorry, should have said "host group" not service group [03:18:35] because it's on the virtual host i'm adding [03:19:13] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?hostgroup=all&style=summary&nostatusheader [03:19:38] maybe i can just skip the group altogether [03:19:46] hosts not in group.. exists [03:20:52] RECOVERY - Disk space on search1019 is OK: DISK OK [03:21:21] <^d|voted> !log restarted lucene-search-2 on search1019: it'd been timing out for a few days and filled disk with log files. [03:21:32] Logged the message, Master [03:23:56] (03CR) 10Dzahn: add monitoring for search.wm (Apple dict bridge) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/171193 (owner: 10Dzahn) [03:24:35] (03PS2) 10Dzahn: add monitoring for search.wm (Apple dict bridge) [puppet] - 10https://gerrit.wikimedia.org/r/171193 [03:25:36] (03PS3) 10Dzahn: add monitoring for search.wm (Apple dict bridge) [puppet] - 10https://gerrit.wikimedia.org/r/171193 [03:26:21] (03CR) 10Dzahn: "@neon:~# /usr/lib/nagios/plugins/check_http -S -H search.wikimedia.org -u "https://search.wikimedia.org/?lang=en&site=wikipedia&search=Wik" [puppet] - 10https://gerrit.wikimedia.org/r/171193 (owner: 10Dzahn) [03:26:23] (03CR) 10Chad: [C: 031] add monitoring for search.wm (Apple dict bridge) [puppet] - 10https://gerrit.wikimedia.org/r/171193 (owner: 10Dzahn) [03:28:43] thanks Chad, cya later .. /away zzzz [03:28:55] <^d|voted> g'night [03:40:51] PROBLEM - ElasticSearch health check on elastic1009 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2034: active_shards: 6097: relocating_shards: 12: initializing_shards: 1: unassigned_shards: 1 [03:40:51] PROBLEM - ElasticSearch health check on elastic1010 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2034: active_shards: 6097: relocating_shards: 12: initializing_shards: 1: unassigned_shards: 1 [03:40:51] PROBLEM - ElasticSearch health check on elastic1023 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2034: active_shards: 6097: relocating_shards: 12: initializing_shards: 1: unassigned_shards: 1 [03:40:51] PROBLEM - ElasticSearch health check on elastic1019 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2034: active_shards: 6097: relocating_shards: 12: initializing_shards: 1: unassigned_shards: 1 [03:40:51] PROBLEM - ElasticSearch health check on elastic1029 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2034: active_shards: 6097: relocating_shards: 12: initializing_shards: 1: unassigned_shards: 1 [03:40:52] PROBLEM - ElasticSearch health check on elastic1028 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2034: active_shards: 6097: relocating_shards: 12: initializing_shards: 1: unassigned_shards: 1 [03:40:52] PROBLEM - ElasticSearch health check on elastic1024 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2034: active_shards: 6097: relocating_shards: 12: initializing_shards: 1: unassigned_shards: 1 [03:42:02] RECOVERY - ElasticSearch health check on elastic1010 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2034: active_shards: 6097: relocating_shards: 12: initializing_shards: 0: unassigned_shards: 0 [03:42:03] RECOVERY - ElasticSearch health check on elastic1023 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2034: active_shards: 6097: relocating_shards: 12: initializing_shards: 0: unassigned_shards: 0 [03:42:03] RECOVERY - ElasticSearch health check on elastic1019 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2034: active_shards: 6097: relocating_shards: 12: initializing_shards: 0: unassigned_shards: 0 [03:42:03] RECOVERY - ElasticSearch health check on elastic1024 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2034: active_shards: 6097: relocating_shards: 12: initializing_shards: 0: unassigned_shards: 0 [03:42:03] RECOVERY - ElasticSearch health check on elastic1028 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2034: active_shards: 6097: relocating_shards: 12: initializing_shards: 0: unassigned_shards: 0 [03:42:03] RECOVERY - ElasticSearch health check on elastic1009 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2034: active_shards: 6097: relocating_shards: 12: initializing_shards: 0: unassigned_shards: 0 [03:42:04] RECOVERY - ElasticSearch health check on elastic1029 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2034: active_shards: 6097: relocating_shards: 12: initializing_shards: 0: unassigned_shards: 0 [03:47:32] PROBLEM - Disk space on ocg1001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=72%): [03:54:51] PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2029: active_shards: 6081: relocating_shards: 9: initializing_shards: 2: unassigned_shards: 1 [03:54:51] PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2029: active_shards: 6081: relocating_shards: 9: initializing_shards: 2: unassigned_shards: 1 [03:55:52] RECOVERY - ElasticSearch health check on elastic1001 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2030: active_shards: 6085: relocating_shards: 11: initializing_shards: 0: unassigned_shards: 0 [03:55:52] RECOVERY - ElasticSearch health check on elastic1015 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2030: active_shards: 6085: relocating_shards: 11: initializing_shards: 0: unassigned_shards: 0 [04:11:43] PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2029: active_shards: 6082: relocating_shards: 3: initializing_shards: 1: unassigned_shards: 1 [04:12:12] PROBLEM - ElasticSearch health check on elastic1025 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2028: active_shards: 6079: relocating_shards: 3: initializing_shards: 2: unassigned_shards: 2 [04:12:21] PROBLEM - ElasticSearch health check on elastic1017 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2030: active_shards: 6081: relocating_shards: 3: initializing_shards: 5: unassigned_shards: 1 [04:12:21] PROBLEM - ElasticSearch health check on elastic1021 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2030: active_shards: 6081: relocating_shards: 3: initializing_shards: 5: unassigned_shards: 1 [04:12:51] RECOVERY - ElasticSearch health check on elastic1015 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2031: active_shards: 6088: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [04:13:32] RECOVERY - ElasticSearch health check on elastic1021 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2030: active_shards: 6085: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0 [04:14:21] RECOVERY - ElasticSearch health check on elastic1025 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2028: active_shards: 6079: relocating_shards: 5: initializing_shards: 0: unassigned_shards: 0 [04:14:22] PROBLEM - ElasticSearch health check on elastic1018 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2028: active_shards: 6079: relocating_shards: 5: initializing_shards: 1: unassigned_shards: 1 [04:14:22] PROBLEM - ElasticSearch health check on elastic1007 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2028: active_shards: 6079: relocating_shards: 5: initializing_shards: 1: unassigned_shards: 1 [04:14:22] PROBLEM - ElasticSearch health check on elastic1009 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2028: active_shards: 6079: relocating_shards: 5: initializing_shards: 1: unassigned_shards: 1 [04:14:22] PROBLEM - ElasticSearch health check on elastic1024 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2028: active_shards: 6079: relocating_shards: 5: initializing_shards: 1: unassigned_shards: 1 [04:14:22] PROBLEM - ElasticSearch health check on elastic1010 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2028: active_shards: 6079: relocating_shards: 5: initializing_shards: 1: unassigned_shards: 1 [04:14:23] PROBLEM - ElasticSearch health check on elastic1019 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2028: active_shards: 6079: relocating_shards: 5: initializing_shards: 1: unassigned_shards: 1 [04:14:23] PROBLEM - ElasticSearch health check on elastic1029 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2028: active_shards: 6079: relocating_shards: 5: initializing_shards: 1: unassigned_shards: 1 [04:14:23] PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2028: active_shards: 6079: relocating_shards: 5: initializing_shards: 1: unassigned_shards: 1 [04:14:24] PROBLEM - ElasticSearch health check on elastic1023 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2028: active_shards: 6079: relocating_shards: 5: initializing_shards: 1: unassigned_shards: 1 [04:14:24] PROBLEM - ElasticSearch health check on elastic1028 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2028: active_shards: 6079: relocating_shards: 5: initializing_shards: 1: unassigned_shards: 1 [04:15:24] RECOVERY - ElasticSearch health check on elastic1018 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2029: active_shards: 6082: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [04:15:24] RECOVERY - ElasticSearch health check on elastic1007 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2029: active_shards: 6082: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [04:15:24] RECOVERY - ElasticSearch health check on elastic1024 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2029: active_shards: 6082: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [04:15:24] RECOVERY - ElasticSearch health check on elastic1023 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2029: active_shards: 6082: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [04:15:24] RECOVERY - ElasticSearch health check on elastic1029 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2029: active_shards: 6082: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [04:15:25] RECOVERY - ElasticSearch health check on elastic1010 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2029: active_shards: 6082: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [04:15:25] RECOVERY - ElasticSearch health check on elastic1009 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2029: active_shards: 6082: relocating_shards: 6: initializing_shards: 0: unassigned_shards: 0 [04:15:51] jgage: around? [04:17:31] RECOVERY - ElasticSearch health check on elastic1028 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2032: active_shards: 6091: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0 [04:17:32] RECOVERY - ElasticSearch health check on elastic1019 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2032: active_shards: 6091: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0 [04:17:32] RECOVERY - ElasticSearch health check on elastic1004 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2032: active_shards: 6091: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0 [04:17:34] RECOVERY - ElasticSearch health check on elastic1017 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2032: active_shards: 6091: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0 [04:17:53] * ^d|voted twiddles thumbs [04:18:05] ah, you're around :) [04:19:30] (03CR) 10Ori.livneh: [C: 04-1] "Looks good, except that this should go in the role class. Where do you intend to apply it?" [puppet] - 10https://gerrit.wikimedia.org/r/171193 (owner: 10Dzahn) [04:32:31] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [04:34:25] <^d|voted> ori: What's up? [04:34:58] the 5xx? not sure, looking [04:35:24] seems to have stopped, some brief spike [04:35:41] <^d|voted> No, I thought you were wanting me for something when you said "oh you're around" [04:35:42] <^d|voted> :) [04:36:23] oh, i was wondering if the ElasticSearch stuff was under control [04:37:19] <^d|voted> Yeah. We're doing a rebuild of all indexes (I !logged it earlier) [04:37:30] oh! i missed that. [04:37:36] <^d|voted> Sometimes icinga will complain a bit as 1 or 2 shards end up uninitialized. [04:37:47] <^d|voted> (which is likely, since I'm expanding shards & replicas for the larger wikis) [04:38:27] (03CR) 10Glaisher: "Yeah, perhaps we need to close (maybe delete?) the wiki first before this is done." [puppet] - 10https://gerrit.wikimedia.org/r/170925 (owner: 10Glaisher) [04:38:34] neat. is it fun to work with? [04:38:39] elastic, that is, not icinga ;) [04:38:56] <^d|voted> it's a very nice system to work with. [04:39:09] <^d|voted> rest api is well-documented and just makes sense. [04:40:13] <^d|voted> They also have what they call a "cat api" which is the most brilliant read api ever :) [04:40:40] <^d|voted> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cat.html [04:41:53] * ^d|voted goes to find some ice cream [04:42:53] ah, that is kinda nice [04:43:15] really nice, actually [04:46:12] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [04:48:02] PROBLEM - ElasticSearch health check on elastic1029 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2034: active_shards: 6097: relocating_shards: 3: initializing_shards: 1: unassigned_shards: 1 [04:48:02] PROBLEM - ElasticSearch health check on elastic1009 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2034: active_shards: 6097: relocating_shards: 3: initializing_shards: 1: unassigned_shards: 1 [04:48:02] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2034: active_shards: 6097: relocating_shards: 3: initializing_shards: 1: unassigned_shards: 1 [04:48:02] PROBLEM - ElasticSearch health check on elastic1024 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2034: active_shards: 6097: relocating_shards: 3: initializing_shards: 1: unassigned_shards: 1 [04:49:12] RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2035: active_shards: 6100: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0 [04:49:12] RECOVERY - ElasticSearch health check on elastic1024 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2035: active_shards: 6100: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0 [04:49:12] RECOVERY - ElasticSearch health check on elastic1029 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2035: active_shards: 6100: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0 [04:49:12] RECOVERY - ElasticSearch health check on elastic1009 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2035: active_shards: 6100: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0 [05:11:13] PROBLEM - Apache HTTP on mw1031 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.014 second response time [05:11:13] PROBLEM - Apache HTTP on mw1020 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.018 second response time [05:11:21] PROBLEM - HHVM rendering on mw1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:11:31] PROBLEM - Apache HTTP on mw1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:11:42] PROBLEM - Apache HTTP on mw1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:12:41] PROBLEM - Apache HTTP on mw1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:12:43] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [05:12:51] PROBLEM - Apache HTTP on mw1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:13:01] PROBLEM - HHVM rendering on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:13:24] PROBLEM - HHVM rendering on mw1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:13:27] PROBLEM - HHVM rendering on mw1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:13:27] PROBLEM - Apache HTTP on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:13:29] PROBLEM - HHVM rendering on mw1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 1.398 second response time [05:13:31] RECOVERY - Apache HTTP on mw1031 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 4.983 second response time [05:13:31] PROBLEM - Apache HTTP on mw1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:13:42] RECOVERY - Apache HTTP on mw1019 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.095 second response time [05:13:51] PROBLEM - HHVM rendering on mw1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:13:56] PROBLEM - Apache HTTP on mw1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:14:02] PROBLEM - HHVM rendering on mw1026 is CRITICAL: Connection timed out [05:14:02] PROBLEM - HHVM rendering on mw1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:14:02] PROBLEM - Apache HTTP on mw1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:14:12] RECOVERY - HHVM rendering on mw1023 is OK: HTTP OK: HTTP/1.1 200 OK - 67822 bytes in 0.952 second response time [05:14:22] PROBLEM - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:14:26] PROBLEM - HHVM rendering on mw1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:14:42] PROBLEM - Apache HTTP on mw1028 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 3.169 second response time [05:14:51] PROBLEM - HHVM rendering on mw1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:14:57] PROBLEM - HHVM rendering on mw1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:15:01] PROBLEM - HHVM rendering on mw1025 is CRITICAL: Connection timed out [05:15:01] PROBLEM - Apache HTTP on mw1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:15:02] PROBLEM - HHVM rendering on mw1031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:15:11] PROBLEM - Apache HTTP on mw1017 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 2.574 second response time [05:15:12] RECOVERY - HHVM rendering on mw1017 is OK: HTTP OK: HTTP/1.1 200 OK - 67814 bytes in 0.379 second response time [05:15:16] yo [05:15:21] PROBLEM - Apache HTTP on mw1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:15:35] PROBLEM - Apache HTTP on mw1029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:15:36] PROBLEM - HHVM rendering on mw1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:15:40] Hey jgage [05:15:51] RECOVERY - Apache HTTP on mw1023 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.279 second response time [05:16:32] RECOVERY - HHVM rendering on mw1028 is OK: HTTP OK: HTTP/1.1 200 OK - 67814 bytes in 1.023 second response time [05:16:35] hmmm [05:16:51] PROBLEM - Apache HTTP on mw1031 is CRITICAL: Connection timed out [05:17:07] jgage, I was hitting 503s on MediaWiki.org (I have HHVM enabled) repeatedly, but it finally went through. [05:17:11] Doing a preview. [05:17:12] PROBLEM - Apache HTTP on mw1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:17:12] RECOVERY - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 67814 bytes in 0.492 second response time [05:17:23] PROBLEM - HHVM rendering on mw1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:17:33] RECOVERY - Apache HTTP on mw1029 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.069 second response time [05:17:49] thanks superm401. not sure what the cause was, but at least it's hhvm-specific [05:17:55] * jgage pokes around [05:17:57] PROBLEM - HHVM rendering on mw1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:17:58] RECOVERY - Apache HTTP on mw1028 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.097 second response time [05:18:23] RECOVERY - HHVM rendering on mw1020 is OK: HTTP OK: HTTP/1.1 200 OK - 67814 bytes in 0.420 second response time [05:18:31] I don't actually know that, I just know the problem happened with HHVM, not that it works fine for everyone else. :) [05:18:32] PROBLEM - HHVM rendering on mw1029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:18:35] PROBLEM - HHVM rendering on mw1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:19:05] RECOVERY - Apache HTTP on mw1020 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.222 second response time [05:19:14] PROBLEM - Apache HTTP on mw1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:19:20] yeah aside from hhvm we look ok [05:19:43] RECOVERY - HHVM rendering on mw1017 is OK: HTTP OK: HTTP/1.1 200 OK - 67814 bytes in 0.625 second response time [05:20:04] PROBLEM - HHVM rendering on mw1028 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 4.398 second response time [05:20:13] RECOVERY - Apache HTTP on mw1023 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.124 second response time [05:20:55] PROBLEM - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is CRITICAL: Connection timed out [05:21:04] PROBLEM - Apache HTTP on mw1029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:21:34] RECOVERY - Apache HTTP on mw1031 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 3.093 second response time [05:21:53] PROBLEM - Apache HTTP on mw1025 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 0.043 second response time [05:21:54] RECOVERY - Apache HTTP on mw1017 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.452 second response time [05:22:23] RECOVERY - HHVM rendering on mw1023 is OK: HTTP OK: HTTP/1.1 200 OK - 67822 bytes in 0.776 second response time [05:22:35] PROBLEM - Apache HTTP on mw1020 is CRITICAL: Connection timed out [05:22:43] RECOVERY - HHVM rendering on mw1025 is OK: HTTP OK: HTTP/1.1 200 OK - 67814 bytes in 0.759 second response time [05:22:54] RECOVERY - HHVM rendering on mw1031 is OK: HTTP OK: HTTP/1.1 200 OK - 67822 bytes in 0.470 second response time [05:23:04] PROBLEM - HHVM rendering on mw1020 is CRITICAL: Connection timed out [05:23:13] PROBLEM - HHVM rendering on mw1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:23:26] RECOVERY - HHVM rendering on mw1030 is OK: HTTP OK: HTTP/1.1 200 OK - 67822 bytes in 0.251 second response time [05:23:33] RECOVERY - HHVM rendering on mw1018 is OK: HTTP OK: HTTP/1.1 200 OK - 67814 bytes in 0.334 second response time [05:23:34] RECOVERY - Apache HTTP on mw1018 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.086 second response time [05:23:44] RECOVERY - HHVM rendering on mw1019 is OK: HTTP OK: HTTP/1.1 200 OK - 67814 bytes in 9.787 second response time [05:23:55] RECOVERY - Apache HTTP on mw1019 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.161 second response time [05:23:55] RECOVERY - Apache HTTP on mw1030 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.099 second response time [05:24:13] RECOVERY - HHVM rendering on mw1017 is OK: HTTP OK: HTTP/1.1 200 OK - 67814 bytes in 0.450 second response time [05:24:45] damn. [05:24:49] i'll roll back to 5%. [05:24:56] thanks [05:25:04] RECOVERY - Apache HTTP on mw1025 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.062 second response time [05:25:07] i tried restarting a couple, but that doesn't seem like a real solution [05:25:13] RECOVERY - LVS HTTP IPv4 on hhvm-appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 67814 bytes in 0.267 second response time [05:25:17] (03PS1) 10Ori.livneh: $wgPercentHHVM: 5% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171198 [05:25:29] (03CR) 10Ori.livneh: [C: 032] $wgPercentHHVM: 5% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171198 (owner: 10Ori.livneh) [05:25:36] (03Merged) 10jenkins-bot: $wgPercentHHVM: 5% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171198 (owner: 10Ori.livneh) [05:25:36] RECOVERY - HHVM rendering on mw1028 is OK: HTTP OK: HTTP/1.1 200 OK - 67814 bytes in 0.277 second response time [05:26:16] !log ori Synchronized wmf-config/CommonSettings.php: $wgPercentHHVM: back to 5% (duration: 00m 11s) [05:26:24] could take up to five minutes to be effective. [05:26:25] Logged the message, Master [05:26:37] ok [05:32:48] PROBLEM - Apache HTTP on mw1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:32:56] PROBLEM - HHVM rendering on mw1017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:39:46] PROBLEM - ElasticSearch health check on elastic1025 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2040: active_shards: 6122: relocating_shards: 7: initializing_shards: 1: unassigned_shards: 1 [05:40:46] RECOVERY - ElasticSearch health check on elastic1025 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2041: active_shards: 6125: relocating_shards: 5: initializing_shards: 0: unassigned_shards: 0 [05:42:34] i blame solar flares [05:42:36] RECOVERY - Disk space on ocg1001 is OK: DISK OK [05:44:36] RECOVERY - HHVM rendering on mw1020 is OK: HTTP OK: HTTP/1.1 200 OK - 67814 bytes in 0.208 second response time [05:45:07] RECOVERY - Apache HTTP on mw1020 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.048 second response time [05:45:11] ori: ^ i restarted hhvm on mw1020. it was doing nothing at all. others are the same [05:45:27] do we need to manually restart them all? [05:45:51] can we salt something? [05:46:05] i did a couple, but lacking deeper understanding i figured the root cause would trigger again [05:46:27] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [05:46:43] might be useful to have at least one still locked up, for gdb or whatnot [05:47:08] traps: hhvm[13572] general protection ip:7f5d2282fced sp:7f5cdb7fa4c0 error:0 in libjemalloc.so.1[7f5d22810000+45000] [05:48:20] i have already restarted hhvm on mw1017 which recovered and then went critical again [05:48:30] nice [05:49:52] i think they'll recover shortly [05:53:17] RECOVERY - Apache HTTP on mw1022 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.066 second response time [05:53:26] RECOVERY - HHVM rendering on mw1024 is OK: HTTP OK: HTTP/1.1 200 OK - 67815 bytes in 4.583 second response time [05:53:27] !log ran: salt -G php:hhvm cmd.run 'restart hhvm' [05:53:29] RECOVERY - HHVM rendering on mw1027 is OK: HTTP OK: HTTP/1.1 200 OK - 67814 bytes in 0.205 second response time [05:53:36] RECOVERY - Apache HTTP on mw1026 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.055 second response time [05:53:36] RECOVERY - HHVM rendering on mw1022 is OK: HTTP OK: HTTP/1.1 200 OK - 67814 bytes in 0.191 second response time [05:53:38] Logged the message, Master [05:53:46] RECOVERY - Apache HTTP on mw1024 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.053 second response time [05:53:57] RECOVERY - Apache HTTP on mw1017 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.047 second response time [05:54:00] RECOVERY - HHVM rendering on mw1053 is OK: HTTP OK: HTTP/1.1 200 OK - 67814 bytes in 0.190 second response time [05:54:00] RECOVERY - HHVM rendering on mw1017 is OK: HTTP OK: HTTP/1.1 200 OK - 67814 bytes in 0.213 second response time [05:54:00] RECOVERY - HHVM rendering on mw1029 is OK: HTTP OK: HTTP/1.1 200 OK - 67814 bytes in 0.629 second response time [05:54:06] RECOVERY - Apache HTTP on mw1021 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.054 second response time [05:54:06] RECOVERY - HHVM rendering on mw1021 is OK: HTTP OK: HTTP/1.1 200 OK - 67814 bytes in 0.170 second response time [05:54:06] RECOVERY - HHVM rendering on mw1026 is OK: HTTP OK: HTTP/1.1 200 OK - 67814 bytes in 0.183 second response time [05:54:07] RECOVERY - Apache HTTP on mw1027 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.054 second response time [05:54:16] RECOVERY - Apache HTTP on mw1029 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.064 second response time [05:54:16] RECOVERY - Apache HTTP on mw1053 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.061 second response time [05:56:17] i haven't been able to reproduce this crash [05:56:43] but in the future if i want to try out a fix, i'll increase the weight of one server in pybal [05:57:56] it only occurred to me that i could do that while halfway through typing ", but there's no way to increase the load selectively on just one server" [06:02:38] ori: is hhvm gdb friendly with symbols? would it help if we grabbed a core file or "thread apply all bt"? [06:03:25] springle: yes, i have core dumps, and backtraces. looking at them with bsimmers @ #mediawiki-core [06:03:45] nice [06:04:00] it looks like a double-free stemming from the fix for the memory leak: https://dpaste.de/rATL/raw [06:04:00] figured it was probably being handled after i typed the above :) [06:04:47] though we're now at fix #2, after a do-over to fix this crash [06:07:12] it stems from the DOMDocument extension code, which is exceptionally awful [06:07:48] <_joe_> ori: hey [06:08:01] <_joe_> what has happened? [06:08:41] _joe_: hey. tl;dr: got fix #2 from bsimmers; applied on beta / mw1017; looked good; applied elsewhere; looked good; increased $wgPercentHHVM from 5 to 10, looked good for a few hours, then this ^. [06:08:58] _joe_: went back to 5, restarted the fleet. [06:09:06] PROBLEM - ElasticSearch health check on elastic1031 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2051: active_shards: 6161: relocating_shards: 6: initializing_shards: 1: unassigned_shards: 2 [06:09:11] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2051: active_shards: 6161: relocating_shards: 6: initializing_shards: 1: unassigned_shards: 2 [06:09:27] PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2051: active_shards: 6161: relocating_shards: 7: initializing_shards: 1: unassigned_shards: 2 [06:09:27] PROBLEM - ElasticSearch health check on elastic1009 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2051: active_shards: 6161: relocating_shards: 7: initializing_shards: 1: unassigned_shards: 2 [06:09:28] PROBLEM - ElasticSearch health check on elastic1030 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2051: active_shards: 6161: relocating_shards: 7: initializing_shards: 1: unassigned_shards: 2 [06:09:28] PROBLEM - ElasticSearch health check on elastic1029 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2051: active_shards: 6161: relocating_shards: 7: initializing_shards: 1: unassigned_shards: 2 [06:09:28] PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2051: active_shards: 6161: relocating_shards: 7: initializing_shards: 1: unassigned_shards: 2 [06:09:28] PROBLEM - ElasticSearch health check on elastic1025 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2051: active_shards: 6161: relocating_shards: 7: initializing_shards: 1: unassigned_shards: 2 [06:09:28] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2051: active_shards: 6161: relocating_shards: 7: initializing_shards: 1: unassigned_shards: 2 [06:09:29] PROBLEM - ElasticSearch health check on elastic1019 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2051: active_shards: 6161: relocating_shards: 7: initializing_shards: 1: unassigned_shards: 2 [06:09:40] the elastic stuff is ^d|voted moving stuff around [06:09:51] i don't think he has access to icinga-admin, so i'm not sure he can ack or mute the alerts [06:10:03] <_joe_> ok np [06:10:12] <_joe_> so another round another crash [06:10:19] <_joe_> again a double free? [06:10:25] yes [06:10:41] <_joe_> :/ [06:10:43] we thought we had the crash isolated, because we could reproduce it [06:11:02] bsimmer's patch fixed the reproduction case, but apparently left a bug elsewhere [06:12:26] PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2051: active_shards: 6161: relocating_shards: 10: initializing_shards: 1: unassigned_shards: 2 [06:12:35] <^d|voted> ori: Nope, I can't. [06:12:56] the memory model of the DOMDocument extension is that it's the document's responsibility to free its DOM nodes when it is freed, including DOM nodes that have been detached from it [06:12:57] <_joe_> ori: damn,; well, good work anyways [06:13:36] <_joe_> did you submit the patch to our repo? [06:13:52] <_joe_> (just to know if I need to move things around) [06:14:01] (copied from #wikimedia-qa) hey everybody, seems like beta labs has gone south, everything is 503 error, even load.php and http://en.wikipedia.beta.wmflabs.org/w/api.php [06:14:06] PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2051: active_shards: 6161: relocating_shards: 9: initializing_shards: 1: unassigned_shards: 2 [06:14:07] PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2051: active_shards: 6161: relocating_shards: 9: initializing_shards: 1: unassigned_shards: 2 [06:14:07] PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2051: active_shards: 6161: relocating_shards: 9: initializing_shards: 1: unassigned_shards: 2 [06:14:52] spagewmf: restarted hhvm on the beta app servers [06:15:09] if the DOM node gets attached to a different document, it's supposed to be remove from the roster of orphan nodes of its original parent document [06:15:46] ori, thanks. just now? I was about to paste in the "deployment-mediawiki02 hhvm: Failed to initialize central HHBC repository:#012 ..." line. [06:16:28] giving rise to functions with names like 'appendOrphanIfNeeded' [06:16:31] spagewmf: yes, just now [06:17:57] PROBLEM - ElasticSearch health check on elastic1010 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2051: active_shards: 6161: relocating_shards: 3: initializing_shards: 2: unassigned_shards: 1 [06:21:16] RECOVERY - ElasticSearch health check on elastic1010 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2051: active_shards: 6162: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [06:21:36] RECOVERY - ElasticSearch health check on elastic1001 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2051: active_shards: 6162: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0 [06:21:36] RECOVERY - ElasticSearch health check on elastic1011 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2051: active_shards: 6162: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0 [06:21:36] RECOVERY - ElasticSearch health check on elastic1015 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2051: active_shards: 6162: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0 [06:21:46] RECOVERY - ElasticSearch health check on elastic1031 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2051: active_shards: 6162: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0 [06:21:46] RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2051: active_shards: 6162: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0 [06:21:57] RECOVERY - ElasticSearch health check on elastic1013 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2051: active_shards: 6162: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [06:22:07] RECOVERY - ElasticSearch health check on elastic1029 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2051: active_shards: 6162: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [06:22:07] RECOVERY - ElasticSearch health check on elastic1025 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2051: active_shards: 6162: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [06:22:08] RECOVERY - ElasticSearch health check on elastic1030 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2051: active_shards: 6162: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [06:22:08] RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2051: active_shards: 6162: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [06:22:08] RECOVERY - ElasticSearch health check on elastic1012 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2051: active_shards: 6162: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [06:22:08] RECOVERY - ElasticSearch health check on elastic1014 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2051: active_shards: 6162: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [06:22:08] RECOVERY - ElasticSearch health check on elastic1019 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2051: active_shards: 6162: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [06:22:08] RECOVERY - ElasticSearch health check on elastic1009 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2051: active_shards: 6162: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [06:26:26] RECOVERY - Disk space on ocg1002 is OK: DISK OK [06:26:27] RECOVERY - Disk space on ocg1003 is OK: DISK OK [06:28:37] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:40] (03PS5) 10Springle: dbproxy1002 haproxy monitoring [puppet] - 10https://gerrit.wikimedia.org/r/170663 [06:28:47] PROBLEM - puppet last run on db1015 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:56] (03PS6) 10Springle: dbproxy1002 haproxy monitoring [puppet] - 10https://gerrit.wikimedia.org/r/170663 [06:29:09] !log depool wtp1013, wtp1014, wtp1015, wtp1016, wtp1023 for trusty reinstallation [06:29:18] Logged the message, Master [06:29:48] PROBLEM - puppet last run on dbproxy1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:07] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:08] PROBLEM - puppet last run on db1042 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:17] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 2 failures [06:30:19] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:28] PROBLEM - puppet last run on lvs2004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:35] (03CR) 10Springle: [C: 032] dbproxy1002 haproxy monitoring [puppet] - 10https://gerrit.wikimedia.org/r/170663 (owner: 10Springle) [06:30:37] PROBLEM - puppet last run on virt1006 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:38] PROBLEM - puppet last run on mw1114 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:38] PROBLEM - puppet last run on db1051 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:48] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:12] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:17] PROBLEM - DPKG on amssq57 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [06:31:27] PROBLEM - puppet last run on cp4014 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:37] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:48] PROBLEM - puppet last run on db1034 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:57] PROBLEM - puppet last run on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:32:33] ...and that's puppetmaster o'clock [06:34:00] PROBLEM - check if salt-minion is running on wtp1023 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [06:34:19] PROBLEM - check if salt-minion is running on wtp1013 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [06:34:28] PROBLEM - check if salt-minion is running on wtp1014 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [06:34:48] PROBLEM - check if salt-minion is running on wtp1015 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [06:34:59] PROBLEM - check if salt-minion is running on wtp1016 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [06:36:36] _joe_: wmf-reimage rulez [06:37:04] <_joe_> akosiaris: I have 200 servers to reimage [06:37:21] <_joe_> can you imagine _that_ without some basic automation? [06:37:27] I only got 24 :) [06:37:28] PROBLEM - puppet last run on ssl3001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:51] it is still in the range of repetitive mindless task but doable [06:38:11] RECOVERY - DPKG on amssq57 is OK: All packages OK [06:40:13] http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&h=wtp1015.eqiad.wmnet&m=cpu_report&s=by+name&mc=2&g=cpu_report&c=Parsoid+eqiad [06:40:25] spot the long running parsoid process [06:43:28] PROBLEM - haproxy alive on dbproxy1002 is CRITICAL: CRITICAL check_alive invalid response [06:44:17] RECOVERY - puppet last run on searchidx1001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:45:28] PROBLEM - CI: Puppet failure events on labmon1001 is CRITICAL: CRITICAL: integration.integration-slave1003.puppetagent.failed_events.value (33.33%) [06:46:02] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:46:05] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [06:46:24] RECOVERY - puppet last run on db1015 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:46:33] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:33] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:42] RECOVERY - puppet last run on db1042 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [06:46:42] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:43] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:46:53] RECOVERY - puppet last run on lvs2004 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [06:46:53] RECOVERY - puppet last run on virt1006 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [06:46:53] RECOVERY - puppet last run on db1034 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [06:47:03] RECOVERY - puppet last run on db1051 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:47:23] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [06:47:24] RECOVERY - puppet last run on dbproxy1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:47:34] RECOVERY - puppet last run on cp4014 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:48:02] RECOVERY - puppet last run on mw1114 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [06:52:37] (03CR) 10Yuvipanda: [C: 04-1] "dev_environ and exec_environ are both included on -dev and -login, so this would cause puppet conflict there. Should be removed from dev_e" [puppet] - 10https://gerrit.wikimedia.org/r/171192 (owner: 10Dzahn) [06:55:21] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: Puppet has 2 failures [06:55:34] RECOVERY - puppet last run on ssl3001 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [06:56:26] PROBLEM - puppet last run on virt1000 is CRITICAL: CRITICAL: Puppet has 2 failures [06:59:31] PROBLEM - Apache HTTP on mw1025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:59:34] PROBLEM - HHVM rendering on mw1025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:00:14] PROBLEM - check_puppetrun on lutetium is CRITICAL: CRITICAL: Puppet has 2 failures [07:00:32] 503ing again [07:00:52] RECOVERY - haproxy alive on dbproxy1002 is OK: OK check_alive uptime 92962s [07:01:45] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [07:01:45] PROBLEM - Apache HTTP on mw1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:01:53] PROBLEM - HHVM rendering on mw1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:01:53] PROBLEM - Apache HTTP on mw1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:02:32] PROBLEM - HHVM rendering on mw1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:02:53] PROBLEM - HHVM rendering on mw1029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:03:02] PROBLEM - Apache HTTP on mw1029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:03:57] PROBLEM - HHVM rendering on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:04:02] PROBLEM - Apache HTTP on mw1053 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:04:03] ugh [07:04:11] ori, around? [07:04:19] yes, i'll restart them again [07:04:25] and revert the package [07:04:32] PROBLEM - Apache HTTP on mw1030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:05:02] RECOVERY - Apache HTTP on mw1029 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.063 second response time [07:05:19] RECOVERY - check_puppetrun on lutetium is OK: OK: Puppet is currently enabled, last run 173 seconds ago with 0 failures [07:05:31] PROBLEM - HHVM rendering on mw1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:05:43] PROBLEM - Apache HTTP on mw1019 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 1.912 second response time [07:05:51] RECOVERY - Apache HTTP on mw1025 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.056 second response time [07:06:00] RECOVERY - HHVM rendering on mw1027 is OK: HTTP OK: HTTP/1.1 200 OK - 68143 bytes in 0.243 second response time [07:06:05] RECOVERY - Apache HTTP on mw1030 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 9.346 second response time [07:06:12] RECOVERY - HHVM rendering on mw1025 is OK: HTTP OK: HTTP/1.1 200 OK - 68143 bytes in 0.237 second response time [07:06:21] RECOVERY - Apache HTTP on mw1021 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.065 second response time [07:06:22] RECOVERY - HHVM rendering on mw1053 is OK: HTTP OK: HTTP/1.1 200 OK - 68143 bytes in 0.247 second response time [07:06:30] RECOVERY - Apache HTTP on mw1027 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.055 second response time [07:06:31] PROBLEM - Apache HTTP on mw1024 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 50350 bytes in 2.936 second response time [07:06:31] RECOVERY - Apache HTTP on mw1053 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.061 second response time [07:06:50] RECOVERY - HHVM rendering on mw1019 is OK: HTTP OK: HTTP/1.1 200 OK - 68143 bytes in 0.222 second response time [07:07:00] RECOVERY - HHVM rendering on mw1021 is OK: HTTP OK: HTTP/1.1 200 OK - 68143 bytes in 0.176 second response time [07:07:01] RECOVERY - Apache HTTP on mw1019 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.055 second response time [07:07:40] RECOVERY - Apache HTTP on mw1024 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.060 second response time [07:07:51] RECOVERY - HHVM rendering on mw1029 is OK: HTTP OK: HTTP/1.1 200 OK - 68143 bytes in 0.259 second response time [07:09:24] PROBLEM - HHVM rendering on mw1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:10:01] PROBLEM - Apache HTTP on mw1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:10:01] PROBLEM - Apache HTTP on mw1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:10:01] PROBLEM - HHVM rendering on mw1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:10:50] RECOVERY - Apache HTTP on mw1028 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.057 second response time [07:10:51] RECOVERY - Apache HTTP on mw1027 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.057 second response time [07:10:51] RECOVERY - HHVM rendering on mw1028 is OK: HTTP OK: HTTP/1.1 200 OK - 68143 bytes in 0.343 second response time [07:11:10] RECOVERY - CI: Puppet failure events on labmon1001 is OK: OK: All targets OK [07:11:19] RECOVERY - HHVM rendering on mw1027 is OK: HTTP OK: HTTP/1.1 200 OK - 68143 bytes in 0.146 second response time [07:11:53] (03PS1) 10Yuvipanda: tools: Specify newer package name for libboost-python-dev [puppet] - 10https://gerrit.wikimedia.org/r/171202 [07:13:08] (03PS2) 10Yuvipanda: tools: Specify newer package name for libboost-python-dev [puppet] - 10https://gerrit.wikimedia.org/r/171202 [07:14:00] RECOVERY - puppet last run on virt1000 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [07:14:37] (03PS3) 10Yuvipanda: tools: Specify newer package name for libboost-python-dev [puppet] - 10https://gerrit.wikimedia.org/r/171202 [07:15:15] (03CR) 10Yuvipanda: [C: 032] tools: Specify newer package name for libboost-python-dev [puppet] - 10https://gerrit.wikimedia.org/r/171202 (owner: 10Yuvipanda) [07:18:14] !log rolled back cluster:appserver_hhvm to version 3.3.0-20140925+wmf3 of hhvm package [07:18:19] Logged the message, Master [07:19:31] (03PS1) 10Yuvipanda: tools: Use ensure => latest for all packages in dev_environ [puppet] - 10https://gerrit.wikimedia.org/r/171203 [07:21:29] (03CR) 10Yuvipanda: [C: 032] tools: Use ensure => latest for all packages in dev_environ [puppet] - 10https://gerrit.wikimedia.org/r/171203 (owner: 10Yuvipanda) [07:21:31] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [07:22:07] thanks for the quick fix, ori [07:22:32] (03PS1) 10Yuvipanda: tools: Don't try to install libvips on trusty hosts yet [puppet] - 10https://gerrit.wikimedia.org/r/171204 [07:23:03] (03CR) 10Yuvipanda: [C: 032] tools: Don't try to install libvips on trusty hosts yet [puppet] - 10https://gerrit.wikimedia.org/r/171204 (owner: 10Yuvipanda) [07:27:47] Eloquence: thanks for the ping [07:28:37] and finally, no more puppet errors on tools trusty. [07:31:37] * _joe_ here now [07:32:14] (03CR) 10Alexandros Kosiaris: [C: 032] servermon: Execute make_updates every hour [puppet] - 10https://gerrit.wikimedia.org/r/170934 (owner: 10Alexandros Kosiaris) [07:32:45] (03CR) 10Alexandros Kosiaris: [C: 032] Make wtp a ganglia aggregator [puppet] - 10https://gerrit.wikimedia.org/r/170954 (owner: 10Alexandros Kosiaris) [07:37:35] (03PS1) 10Ori.livneh: hhvm: set kernel.core_pattern sysctl param [puppet] - 10https://gerrit.wikimedia.org/r/171206 [07:37:45] ^ _joe_ [07:38:15] (03CR) 10jenkins-bot: [V: 04-1] hhvm: set kernel.core_pattern sysctl param [puppet] - 10https://gerrit.wikimedia.org/r/171206 (owner: 10Ori.livneh) [07:38:36] spurious equal sign is spurious [07:38:55] and on second thought, we really ought to make this the default everywhere [07:40:02] <_joe_> ori: I prefer.core at the end [07:40:15] <_joe_> it makes me remember the good ol times of digital unix [07:40:41] <_joe_> when men were men and microkernels were 600 MB [07:40:53] (03PS1) 10Springle: haproxy stats socket level user 666 [puppet] - 10https://gerrit.wikimedia.org/r/171207 [07:42:19] (03CR) 10Springle: [C: 032] haproxy stats socket level user 666 [puppet] - 10https://gerrit.wikimedia.org/r/171207 (owner: 10Springle) [07:42:30] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 5 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [07:42:45] (03PS2) 10Ori.livneh: base: set kernel.core_pattern sysctl param [puppet] - 10https://gerrit.wikimedia.org/r/171206 [07:43:03] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 5 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [07:43:03] akosiaris: ok to puppet-merge? [07:43:56] _joe_: is there a more principled reason? [07:44:13] springle: damn... I forgot it.. thanks [07:44:17] core.* makes the files appear beside one another when listing files [07:44:18] <_joe_> ori: no just habits [07:44:21] PROBLEM - Parsoid on wtp1013 is CRITICAL: Connection refused [07:44:34] akosiaris: np at all. done [07:44:49] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [07:45:00] <_joe_> well, ls *.core usually does the trick, but my comment was just nostalgic, I don't want to do bikeshedding; go on! [07:45:18] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [07:45:39] PROBLEM - puppet last run on wtp1015 is CRITICAL: CRITICAL: Puppet has 1 failures [07:46:01] PROBLEM - puppet last run on wtp1013 is CRITICAL: CRITICAL: Puppet has 2 failures [07:46:50] _joe_: i don't mind, just want to make sure this is right. https://github.com/search?q=kernel.core_pattern&type=Code&utf8=%E2%9C%93 shows more results for core.* than *.core, at a glance [07:47:32] but it's a sysctl parameter for the whole cluster, i'm not going to merge it myself [07:47:40] <_joe_> ori: sorry it's early morning here and I am wasting your night hour [07:47:49] <_joe_> oh you put it into base? [07:48:11] yes, i thought about it and i think it's worth the extra investment of "hmmm" time [07:48:20] there's really no reason to confine it to the hhvm module [07:48:20] * _joe_ nods [07:48:35] if we had existing conflicting patterns, that would be a different matter, but it looks like we don't set this anywhere [07:48:42] <_joe_> ori: I will probably experiment a little with hhvm restarts in beta [07:49:10] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 13 data above and 9 below the confidence bounds [07:49:11] ok; try to remember to log it in #wikimedia-labs [07:49:14] <_joe_> I'm probably going to reinvent the upstart script, more or less; I will preserve all the logic though [07:49:19] <_joe_> I will [07:49:28] please don't kill the upstart script [07:49:33] what exactly do you want to do? [07:50:00] i don't see a problem with the current restarts, other than the fact that we need to have them [07:50:01] <_joe_> ori: when you issue a reload/restart, make the new instance start and let it kill the old one [07:50:20] <_joe_> ori: they are causing a lot of 503s after scap in beta, it seems [07:50:26] well, the socket would be bound until the other one was killed [07:50:26] <_joe_> people got to me complaining [07:50:36] <_joe_> yes I did some extensive tests [07:50:38] you can just set the kill timeout to 0 [07:50:41] <_joe_> see my mail to ops@ [07:50:44] you don't need to redo the whole upstart file for that [07:50:48] <_joe_> no no [07:50:50] RECOVERY - puppet last run on wtp1015 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [07:50:55] <_joe_> I don't want to _redo_ it [07:51:01] <_joe_> just change the restart logic [07:52:20] RECOVERY - puppet last run on wtp1013 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [07:52:28] <_joe_> so, I have done some extensive research on how to do restarts as gracefully as possible, and I'm pretty convinced I have a good solution. [07:52:59] RECOVERY - Parsoid on wtp1013 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.018 second response time [07:53:14] <_joe_> not optimal, as the optimal one works only internally at facebook, where they still use the http server [07:53:25] what is it? i just read your e-mail [07:54:39] as the upstart script notes: [07:54:41] # When `hhvm.server.graceful_shutdown_wait` is set to a positive [07:54:41] # integer, HHVM will perform a graceful shutdown on SIGHUP. [07:54:41] kill signal HUP [07:54:46] <_joe_> ori: so you start the second instance, it goes into a loop trying to bind to the socket, then sends a /stop signal to the other instance and takes over; in the http libevent based server, you could make it take over the socket; in the new version you cant' [07:55:09] <_joe_> ori: except it does not do that properly [07:55:44] what is improper? [07:56:25] <_joe_> ori: kill -HUP will shut down hhvm; then it's restarted by upstart; there is quite the gap in between [07:57:23] <_joe_> ori: the way restarts are more harmless, in my tests, is: start a second instance, send /stop to the first [07:57:53] <_joe_> (HHVM does that internally, but it's broken now, I have a patch to make it work) [07:57:57] what you're suggesting sounds sophisticated, but i think it's too sophisticated. the bigger problem, imo, is the one identified in https://bugzilla.wikimedia.org/show_bug.cgi?id=71212 [07:58:33] we need to depool a server for restarts, not just because of the need to drain connections gracefully, but also because HHVM is a jit and needs a few seconds to warm up [07:58:35] <_joe_> ori: what I suggest is not sophisticated; is what facebook engineers baked into hhvm itself [07:59:18] <_joe_> ori: and you know I agree; but can you try to let me do my job as well? thanks a lot. [07:59:37] oh, interesting [07:59:47] now i see why you're drawing a connection to TakeoverFilename [07:59:55] <_joe_> yes but [08:00:03] <_joe_> TakeoverFilename doesn't work :P [08:00:07] right, https://github.com/facebook/hhvm/issues/4129 [08:00:10] <_joe_> it worked with libevent [08:00:22] <_joe_> no my issue is somewhat misleading as well [08:00:31] <_joe_> the /stop doesn't work because of my issue [08:00:50] <_joe_> but the takeover mechanism doesn't work because in the fastcgi server it's not implemented [08:01:23] <_joe_> still, restarting that way causes less hiccups for a series of reasons [08:01:30] <_joe_> that any other restart method I tried [08:01:38] so, i'm not caving in to the dramaz above, honestly [08:01:43] but i'm starting to think that you're right [08:01:43] <_joe_> I have to see what happens under major real load [08:01:59] but how do you intend to use this functionality on beta without a fix for the bug? [08:02:11] <_joe_> I have the fix [08:02:17] ! [08:02:23] <_joe_> I am about to add a small patch to our package [08:02:28] <_joe_> basically, I am cheating :) [08:02:40] <_joe_> I add a parameter AdminServer.HttpPort [08:02:41] is it a matter of copy-pasting some code from the http server? [08:02:51] <_joe_> no not for the takeover [08:03:14] PROBLEM - Parsoid on wtp1023 is CRITICAL: Connection refused [08:03:16] <_joe_> that, will need a complete reimplementation I guess, bblack and I were looking into it yesterday [08:03:30] <_joe_> but at least make the /stop call work [08:04:10] okay, i'm sorry for giving you a hard time about this, this is really good [08:04:35] PROBLEM - puppet last run on wtp1023 is CRITICAL: CRITICAL: Puppet has 2 failures [08:05:02] <_joe_> np, I overreacted [08:05:14] if i may, though: the simplest solution for now may be to just remove the three lines or so from scap that do the hhvm restart. they were added to solve a very optimistic problem (jit cache filling up eventually after a long uptime) [08:05:24] but we're not facing a long uptime problem, sadly [08:05:37] <_joe_> ori: not at all :P [08:05:57] but opening up the http port and making /stop sounds tolerable [08:06:00] while we wait for a proper fix [08:06:05] assuming it works cleanly with that [08:06:22] <_joe_> I am not sure in prod it will work as well as in beta [08:06:33] <_joe_> because I think we have different balancing strategies [08:10:42] RECOVERY - Parsoid on wtp1023 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.031 second response time [08:10:53] RECOVERY - puppet last run on wtp1023 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [08:11:05] <_joe_> btw I do agree that draining will be the way to go at the end [08:11:05] what about having two instances locally load balanced by apache [08:11:16] <_joe_> ori: I thought about that [08:11:18] bound to different ports [08:11:46] <_joe_> but apache can be horrible as a proxy [08:11:59] well, we're stuck with it as a proxy for independent reasons [08:12:05] <_joe_> and that introduces a lot of headaches as well [08:12:32] <_joe_> I mean all operations become more complicated, and you are basically moving the draining logic on the machine [08:12:45] <_joe_> instead than on the load balancer [08:12:52] (03PS1) 10Alexandros Kosiaris: openstreetmap: Split expired tile list files [puppet] - 10https://gerrit.wikimedia.org/r/171211 [08:13:17] <_joe_> so, that's why I discarded that idea [08:13:28] <_joe_> we have a good balancer that works well, pybal [08:13:47] <_joe_> why should we introduce another that we don't like and with less features? [08:13:51] right, so mark has the idleconnection.py thing [08:14:14] can we make it maintain idle connections to several ports? or just to the fcgi port, for that matter? [08:14:30] <_joe_> to what end? [08:14:32] the fcgi port is not open to pybal, but that can be fixed [08:14:59] <_joe_> the time a restart takes is shorter than the mean time between pybal checks [08:15:19] the idle connection thing is instant, no? [08:15:31] yeah, it's event-driven [08:15:38] <_joe_> mmmh I don't know, I have to check [08:15:43] it is, i looked [08:15:44] <_joe_> ah! you're right [08:15:56] the problem now is that hhvm could be in an awful state, but pybal doesn't know, because the connection to apache is alive [08:15:57] PROBLEM - NTP on wtp1023 is CRITICAL: NTP CRITICAL: Offset unknown [08:15:58] <_joe_> now _this_ is a good idea [08:16:19] if it connects to the fcgi port, then when hhvm stops or crashes, pybal reacts instantly [08:16:20] <_joe_> ori: pybal does check a rendered page as well [08:16:28] yes, but that is periodic [08:16:30] <_joe_> you can have more checks for each services [08:17:17] <_joe_> yeah well, the "hhvm is responding 500s" state will take a few seconds to be detected [08:17:25] <_joe_> not that it's such a big deal imo [08:17:59] <_joe_> the "keep it offline if fastcgi is down" may work [08:18:12] <_joe_> but it depends on how hhvm handles timeouts of connections [08:18:25] not really, because when people say "hhvm is responding 500s", they usually mean "apache is giving 503s because the backend isn't responding to its requests" [08:18:40] and if pybal is watching the fastcgi server, it'll detect that [08:18:45] <_joe_> yes [08:19:18] i dunno about you, but i think this is a pretty nice solution we've arrived at [08:19:22] <_joe_> yes, what I meant was that I've seen hhvm get in a state where it accepts connections, but spits errors [08:19:41] <_joe_> ori: I just need to check that it works, but it looks nice [08:20:09] if it's accepting connections, it hasn't crashed, which means that the error was recoverable, which means that other connections aren't affected, which means the server shouldn't be depooled [08:20:33] RECOVERY - NTP on wtp1023 is OK: NTP OK: Offset -0.008044838905 secs [08:21:09] okay, it's half past midnight, i hate the DOMDocument extension code with the intensity of a thousand burning suns [08:21:13] <_joe_> but we don't have to worry about weird crashes - the usual rendering checks will get those in due time [08:21:16] <_joe_> ahah [08:21:23] <_joe_> go have some rest [08:21:42] apologies again if i was cranky, i'm really grateful that you're thinking through all this [08:21:46] and good night [08:21:46] <_joe_> I will try to see if anything stands in the way of using idleconnection [08:21:52] <_joe_> and the rest [08:22:03] <_joe_> night! [08:44:19] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [08:44:19] PROBLEM - puppet last run on wtp1016 is CRITICAL: CRITICAL: Puppet has 2 failures [08:45:09] RECOVERY - puppet last run on wtp1016 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [08:58:02] PROBLEM - NTP on wtp1016 is CRITICAL: NTP CRITICAL: Offset unknown [08:59:03] RECOVERY - NTP on wtp1016 is OK: NTP OK: Offset -0.006876468658 secs [09:02:03] PROBLEM - Host wtp1013 is DOWN: PING CRITICAL - Packet loss = 100% [09:02:28] RECOVERY - Host wtp1013 is UP: PING OK - Packet loss = 0%, RTA = 3.53 ms [09:09:39] !log repool wtp1013,wtp1014,wtp1015,wtp1016,wtp1017 [09:09:47] Logged the message, Master [09:11:47] !log depool wtp1002, wtp1007-wtp1012 [09:11:56] Logged the message, Master [09:13:39] PROBLEM - ElasticSearch health check on elastic1029 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2051: active_shards: 6169: relocating_shards: 4: initializing_shards: 1: unassigned_shards: 1 [09:13:40] PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2051: active_shards: 6169: relocating_shards: 4: initializing_shards: 1: unassigned_shards: 1 [09:13:40] PROBLEM - ElasticSearch health check on elastic1021 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2051: active_shards: 6169: relocating_shards: 4: initializing_shards: 1: unassigned_shards: 1 [09:13:40] PROBLEM - ElasticSearch health check on elastic1023 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2051: active_shards: 6169: relocating_shards: 4: initializing_shards: 1: unassigned_shards: 1 [09:13:40] PROBLEM - ElasticSearch health check on elastic1012 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2051: active_shards: 6169: relocating_shards: 4: initializing_shards: 1: unassigned_shards: 1 [09:14:39] RECOVERY - ElasticSearch health check on elastic1023 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2052: active_shards: 6172: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0 [09:14:39] RECOVERY - ElasticSearch health check on elastic1021 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2052: active_shards: 6172: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0 [09:14:39] RECOVERY - ElasticSearch health check on elastic1012 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2052: active_shards: 6172: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0 [09:14:39] RECOVERY - ElasticSearch health check on elastic1029 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2052: active_shards: 6172: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0 [09:14:39] RECOVERY - ElasticSearch health check on elastic1006 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2052: active_shards: 6172: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0 [09:15:09] status red ? for 1 unassigned shard and one initializing shard ? ... [09:16:18] mh, there's an additional check that looks at % instead of raw numbers, I think we never disabled the old one tho [09:16:27] (03PS1) 10Giuseppe Lavagetto: Fix the behaviour of the hhvm restart mechanism. [debs/hhvm] - 10https://gerrit.wikimedia.org/r/171217 [09:17:43] (03PS2) 10Faidon Liambotis: Fix the behaviour of the hhvm restart mechanism [debs/hhvm] - 10https://gerrit.wikimedia.org/r/171217 (owner: 10Giuseppe Lavagetto) [09:17:52] sorry, the OCD in me [09:17:58] godog: do you know which servers handle image scaling for beta, and whether the CPU usage for those is graphed somewhere? [09:18:50] <_joe_> ahahah thanks [09:23:43] (03PS1) 10Glaisher: Delete vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171219 (https://bugzilla.wikimedia.org/55737) [09:24:48] PROBLEM - check if salt-minion is running on wtp1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [09:24:49] PROBLEM - check if salt-minion is running on wtp1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [09:24:58] PROBLEM - check if salt-minion is running on wtp1007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [09:25:21] PROBLEM - check if salt-minion is running on wtp1011 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [09:25:21] PROBLEM - check if salt-minion is running on wtp1012 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [09:25:21] PROBLEM - check if salt-minion is running on wtp1009 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [09:25:22] gi11es: nope I don't know offhand which machine it is, resource usage should be available though perhaps graphite.wmflabs.org [09:27:28] RECOVERY - check if salt-minion is running on wtp1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:31:10] RECOVERY - check if salt-minion is running on wtp1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:31:49] RECOVERY - check if salt-minion is running on wtp1011 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:32:28] RECOVERY - check if salt-minion is running on wtp1007 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:35:29] RECOVERY - check if salt-minion is running on wtp1008 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:38:25] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Fix the behaviour of the hhvm restart mechanism [debs/hhvm] - 10https://gerrit.wikimedia.org/r/171217 (owner: 10Giuseppe Lavagetto) [09:41:01] (03CR) 10Faidon Liambotis: [C: 04-1] "What's wrong with dickson? RT?" [dns] - 10https://gerrit.wikimedia.org/r/115093 (owner: 10coren) [09:42:57] RECOVERY - check if salt-minion is running on wtp1009 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:44:56] (03CR) 10Glaisher: "Delete vewikimedia: I345516cd005b2716cd11925d528a182503910f0e" [puppet] - 10https://gerrit.wikimedia.org/r/170925 (owner: 10Glaisher) [09:48:46] _joe_: heh, looks like you missed one when moving monitor_ to monitoring::* [09:48:49] * YuviPanda makes patch [09:50:13] (03PS1) 10Yuvipanda: contint: Fix grpahite monitoring to use monitoring::* [puppet] - 10https://gerrit.wikimedia.org/r/171222 [09:50:38] (03PS2) 10Yuvipanda: contint: Fix graphite monitoring to use monitoring::* [puppet] - 10https://gerrit.wikimedia.org/r/171222 [09:50:44] <_joe_> ach, sorry [09:51:08] _joe_: ^ +1? [09:51:28] (03CR) 10Giuseppe Lavagetto: [C: 031] contint: Fix graphite monitoring to use monitoring::* [puppet] - 10https://gerrit.wikimedia.org/r/171222 (owner: 10Yuvipanda) [09:51:39] <_joe_> how did I miss that? [09:51:43] (03CR) 10Yuvipanda: [C: 032] contint: Fix graphite monitoring to use monitoring::* [puppet] - 10https://gerrit.wikimedia.org/r/171222 (owner: 10Yuvipanda) [09:55:17] RECOVERY - puppet last run on labmon1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:05:00] is my DNS fucked up, or can others also not resolve tools-login.wmflabs.org? [10:05:19] hmm [10:05:29] intermittent, is fine now [10:47:47] <_joe_> !log installed hhvm 3.3.0-20140925+wmf4 on osmium for testing. [10:47:55] Logged the message, Master [11:24:33] PROBLEM - puppet last run on ms-be2003 is CRITICAL: CRITICAL: puppet fail [11:34:05] ^ 502 from the master, I guess it'll work on next try [11:42:44] RECOVERY - puppet last run on ms-be2003 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [11:44:05] (03PS1) 10Yuvipanda: tools: Add python-gdal package [puppet] - 10https://gerrit.wikimedia.org/r/171236 [11:44:26] (03CR) 10Yuvipanda: [C: 032] tools: Add python-gdal package [puppet] - 10https://gerrit.wikimedia.org/r/171236 (owner: 10Yuvipanda) [12:40:56] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [12:52:41] (03PS1) 10Giuseppe Lavagetto: hhvm: remove unnecessary upstart stanza, config option [puppet] - 10https://gerrit.wikimedia.org/r/171244 [12:53:02] <_joe_> ^^ one should never ever take what docs say for granted :/ [12:54:47] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [12:57:03] (03CR) 10Giuseppe Lavagetto: "For a reference in the code:" [puppet] - 10https://gerrit.wikimedia.org/r/171244 (owner: 10Giuseppe Lavagetto) [13:53:20] PROBLEM - check if salt-minion is running on wtp1011 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [13:53:46] PROBLEM - check if salt-minion is running on wtp1008 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [13:53:46] PROBLEM - check if salt-minion is running on wtp1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [13:53:57] PROBLEM - check if salt-minion is running on wtp1007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [13:54:11] PROBLEM - check if salt-minion is running on wtp1009 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [13:56:23] <_joe_> this is like the 'hey, I'm alive!" greeting from our newly reimaged servers [13:57:16] interesting though, would the puppet check fire in a while too? [13:57:24] as in "puppet last run" [13:58:24] niah, not yet... [13:58:46] also... I just finished up upgrading the wtp cluster so that ruby can have a new security vuln [14:01:36] <_joe_> lol [14:02:22] <_joe_> in the meantime, I recently discovered most of the work I did for the last two days was basically useless :) [14:02:37] yay! [14:02:42] why ? [14:09:38] <_joe_> akosiaris: because the graceful restart model that "almost works" in hhvm needs for a second instance of hhvm to be started in parallel, and to let it takeover the old one [14:10:12] <_joe_> and I don't think it's possible to make upstart do that, given it wants to work with signals to the running process instead [14:10:45] so_reuseaddr ? [14:10:52] <_joe_> and HHVM's reaction to any signal is AFAICT, "OMFG LETS'DIE" [14:11:10] so it starts a completely new process [14:11:27] <_joe_> upstart does that for you, yes [14:12:09] <_joe_> when you issue reload, it sends by default HUP (if I recall correctly). This kills hhvm, then upstart respawns it. [14:12:17] just sent an email about the other fundamental way to solve the problem, draining [14:12:29] <_joe_> mark: thanks :) [14:12:33] _joe_: yeah, but that is not graceful [14:12:42] so no go.. [14:12:44] i suppose my old dbus idea might be a bit heavy [14:12:52] let's use systemd ... [14:12:56] <_joe_> akosiaris: there is no graceful mechanism in HHVM. [14:12:56] and socket activation :P [14:12:58] but perhaps it's not very hard to make a pybal health check which is effectively something listening to an event [14:13:05] and then have upstart or whatever contact pybal [14:13:12] <_joe_> akosiaris: I was reading http://0pointer.de/blog/projects/socket-activated-containers.html right now [14:13:18] on restart [14:13:37] <_joe_> mark: ori proposed to use an idleconnection monitor on the fastcgi port [14:13:54] _joe_: well... not sure I want to go both systemd and HHVM at the same time... [14:13:58] <_joe_> that would work I guess [14:14:07] <_joe_> akosiaris: nope, but I got interested :) [14:14:11] it would be an improvement [14:14:16] I do think it would work though [14:14:22] but of course draining _before_ anything happens to any port would be best [14:14:30] well kind of... [14:14:32] <_joe_> mark: I agree [14:14:40] idleconnection is just "know very fast" [14:14:44] but not always fast enough ;) [14:14:49] it was a nice and simple idea at the time [14:14:58] which was essentially what prompted me to write pybal [14:16:24] <_joe_> and anyways, I bet the takeover mechanism works perfectly with the http server that FB uses internally - it is completely broken in the FCGI implementation [14:16:44] as i said in the email, ideally we solve both fundamental problems [14:19:39] sorry I think I might have missed this, but assuming we could export hhvm healthchecks what's wrong with start failing the healthchecks for draining? (i.e. weight a server to 0) [14:20:10] vs removing the server altogether [14:20:39] <_joe_> godog: that was the idea I guess [14:21:01] <_joe_> mark: the only hard part in your idea is "we could even figure out a way where PyBal could slightly delay such an event, and only allow a stop/restart to progress after it has acknowledged depooling said server. [14:21:15] don't think so [14:21:19] <_joe_> at least to do that in a general way [14:21:26] if upstart pre-stop calls a simple pybal helper program [14:21:33] and that program contacts a pybal server port [14:21:36] <_joe_> right [14:21:37] "hey i'm going down" [14:21:46] at which point pybal depools and says "go ahead" [14:21:48] <_joe_> and it waits for pybal ack [14:21:50] <_joe_> right [14:21:51] all with a short timeout, max a few seconds [14:21:51] yep [14:22:15] <_joe_> yes this seems like a good general idea [14:23:09] i was already sending this to the list, sent now [14:23:24] i kinda want to jfdi [14:23:28] but I have no time :( [14:23:51] <_joe_> mark: I can finally work on pybal, now I have an excuse :) [14:24:06] <_joe_> but I guess this afternoon I should work on paging as well [14:24:18] i'm happy to review any code at least [14:25:20] I don't like these ideas much tbh [14:25:33] why not? [14:25:47] _joe_: sure, I was thinking of something simpler without talking to pybal but just failing the healthcheck saying "please drain" [14:26:00] creating an orchestration method between appservers and loadbalancers to handle a problem that can be dealt locally [14:26:05] <_joe_> godog: that was my original proposal [14:26:12] I think it's fundamentally more complex, even if we make it work right [14:26:14] paravoid: it needs to be handled for every server again [14:26:35] it's nice to live in utopia [14:26:37] but we're in the real world here [14:26:46] I don't think graceful restarts are utopia [14:26:59] they're not, and we should solve both, as I said [14:27:16] <_joe_> they're not; in HHVM as it is now, they are. [14:27:17] but thinking we can have graceful restarts for every service we want to load balance, that is utopia [14:27:26] adding orchestration capabilities to pybal is a good idea and one that I've raised in the past myself [14:28:03] <_joe_> paravoid: this is hardly orchestration btw [14:28:06] but relying on such a system to perform the simplest of tasks like restarting a service? can't say it thrills me as an idea [14:28:33] too bad then :) [14:28:38] are we also going to patch check_http so that we don't get alerts during restarts? [14:28:39] <_joe_> paravoid: I have one more argument that's HHVM specific [14:29:08] i don't know what check_http has to do with this [14:29:35] well, when the service is restarted right now, we get alerts [14:29:40] <_joe_> HHVM takes quite a performance hit for the first N requests. Facebook does warm it up before putting it back in rotation, which doesn't seem a bad idea either [14:29:52] <_joe_> paravoid: only during outages [14:30:08] what do you mean outages? there's a window where the service _is_ unresponsive [14:30:09] <_joe_> I have restarted repeatedly hhvm and it is up and running in a few seconds [14:30:10] check_http does multiple tries [14:30:19] <_joe_> yes, a few seconds [14:30:29] <_joe_> we would never get paged for that [14:30:46] I never said paged [14:31:01] check_http doesn't do multiple tries, icinga does (I said that too the other day) [14:31:04] anyway, feel free to solve the HHVM problem while we work on the other one [14:31:22] but you still get a page full of yellow in icinga's unhandled problems page [14:31:28] (which apparently I'm the only one looking :) [14:44:26] PROBLEM - Host wtp1012 is DOWN: PING CRITICAL - Packet loss = 100% [14:44:26] PROBLEM - Host wtp1010 is DOWN: PING CRITICAL - Packet loss = 100% [14:48:04] well, so part of this problem that's not really being mentioned is that LVS is a little different than e.g. the backend of a reverse proxy like varnish or nginx [14:48:21] LVS just routes packets, it doesn't have a concept of retrying client connections [14:48:57] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [14:48:57] if the path from varnish -> HHVM acted more like a reverse proxy, it could patch over the availability window on one machine by retrying the client's request on another that's not just been marked as temporarily failing. [14:49:37] RECOVERY - Host wtp1010 is UP: PING OK - Packet loss = 0%, RTA = 0.98 ms [14:49:37] RECOVERY - Host wtp1012 is UP: PING OK - Packet loss = 0%, RTA = 1.26 ms [14:49:53] (which makes me wonder what the rationale is for not just having the backends listed directly in varnish instead of going through an LVS layer there) [14:49:59] mark had experimented with that [14:50:01] with bits [14:50:02] I mean I'm sure there is one, but is it worth it? [14:50:53] the only real reason is that we're not really setup for it [14:51:06] we don't have any good way of pooling/depooling with varnish configs at the moment at all [14:51:07] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 5 below the confidence bounds [14:51:09] if we would fix that, we could do that [14:51:57] PROBLEM - Parsoid on wtp1010 is CRITICAL: Connection refused [14:51:58] PROBLEM - check if dhclient is running on wtp1012 is CRITICAL: Connection refused by host [14:52:06] PROBLEM - DPKG on wtp1010 is CRITICAL: Connection refused by host [14:52:06] PROBLEM - check if salt-minion is running on wtp1012 is CRITICAL: Connection refused by host [14:52:07] PROBLEM - Disk space on wtp1012 is CRITICAL: Connection refused by host [14:52:16] PROBLEM - parsoid disk space on wtp1012 is CRITICAL: Connection refused by host [14:52:17] PROBLEM - check configured eth on wtp1010 is CRITICAL: Connection refused by host [14:52:26] PROBLEM - check if dhclient is running on wtp1010 is CRITICAL: Connection refused by host [14:52:32] PROBLEM - SSH on wtp1010 is CRITICAL: Connection refused [14:52:32] PROBLEM - Parsoid on wtp1012 is CRITICAL: Connection refused [14:52:32] PROBLEM - puppet last run on wtp1012 is CRITICAL: Connection refused by host [14:52:46] PROBLEM - SSH on wtp1012 is CRITICAL: Connection refused [14:52:47] PROBLEM - check if salt-minion is running on wtp1010 is CRITICAL: Connection refused by host [14:52:47] PROBLEM - RAID on wtp1012 is CRITICAL: Connection refused by host [14:52:47] PROBLEM - parsoid disk space on wtp1010 is CRITICAL: Connection refused by host [14:52:47] PROBLEM - puppet last run on wtp1010 is CRITICAL: Connection refused by host [14:52:47] PROBLEM - check configured eth on wtp1012 is CRITICAL: Connection refused by host [14:52:47] PROBLEM - RAID on wtp1010 is CRITICAL: Connection refused by host [14:52:56] PROBLEM - Disk space on wtp1010 is CRITICAL: Connection refused by host [14:52:57] PROBLEM - DPKG on wtp1012 is CRITICAL: Connection refused by host [14:53:02] (03PS1) 10Faidon Liambotis: Kill all (outdated) references to pmtpa [dns] - 10https://gerrit.wikimedia.org/r/171265 [14:54:02] (03PS1) 10Faidon Liambotis: Allocate frack-codfw [dns] - 10https://gerrit.wikimedia.org/r/171266 [14:56:17] (03PS1) 10QChris: Make varnishkafka pick up Range header [puppet] - 10https://gerrit.wikimedia.org/r/171268 (https://bugzilla.wikimedia.org/73021) [14:57:19] (03CR) 10QChris: [C: 04-1] "This change depends on" [puppet] - 10https://gerrit.wikimedia.org/r/171268 (https://bugzilla.wikimedia.org/73021) (owner: 10QChris) [14:57:57] RECOVERY - SSH on wtp1010 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [14:58:07] RECOVERY - SSH on wtp1012 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [15:05:58] PROBLEM - NTP on wtp1012 is CRITICAL: NTP CRITICAL: No response from NTP server [15:06:06] PROBLEM - NTP on wtp1010 is CRITICAL: NTP CRITICAL: No response from NTP server [15:16:26] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [15:21:24] (03CR) 10Mark Bergsma: [C: 031] Kill all (outdated) references to pmtpa [dns] - 10https://gerrit.wikimedia.org/r/171265 (owner: 10Faidon Liambotis) [15:22:27] (03CR) 10Mark Bergsma: [C: 031] Allocate frack-codfw [dns] - 10https://gerrit.wikimedia.org/r/171266 (owner: 10Faidon Liambotis) [15:26:17] (03PS1) 10KartikMistry: Beta: Enable EventLogging in ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171273 [15:31:33] Is there any example where DB is created/upgraded by Puppet for extension (requires separate DB) [15:36:47] RECOVERY - check configured eth on wtp1010 is OK: NRPE: Unable to read output [15:36:57] RECOVERY - check if dhclient is running on wtp1010 is OK: PROCS OK: 0 processes with command name dhclient [15:37:09] RECOVERY - check if salt-minion is running on wtp1010 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:37:16] RECOVERY - Disk space on wtp1010 is OK: DISK OK [15:37:16] RECOVERY - DPKG on wtp1012 is OK: All packages OK [15:37:17] RECOVERY - RAID on wtp1012 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [15:37:26] RECOVERY - check if dhclient is running on wtp1012 is OK: PROCS OK: 0 processes with command name dhclient [15:37:27] RECOVERY - parsoid disk space on wtp1010 is OK: DISK OK [15:37:27] RECOVERY - DPKG on wtp1010 is OK: All packages OK [15:37:27] RECOVERY - check configured eth on wtp1012 is OK: NRPE: Unable to read output [15:37:27] RECOVERY - RAID on wtp1010 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [15:37:36] RECOVERY - Disk space on wtp1012 is OK: DISK OK [15:37:56] RECOVERY - parsoid disk space on wtp1012 is OK: DISK OK [15:40:10] PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2062: active_shards: 6202: relocating_shards: 3: initializing_shards: 1: unassigned_shards: 1 [15:43:11] RECOVERY - ElasticSearch health check on elastic1011 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2062: active_shards: 6202: relocating_shards: 5: initializing_shards: 0: unassigned_shards: 0 [15:43:31] PROBLEM - Parsoid on wtp1008 is CRITICAL: Connection refused [15:43:31] PROBLEM - Parsoid on wtp1011 is CRITICAL: Connection refused [15:43:41] PROBLEM - Parsoid on wtp1007 is CRITICAL: Connection refused [15:43:41] PROBLEM - ElasticSearch health check on elastic1002 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2062: active_shards: 6202: relocating_shards: 5: initializing_shards: 1: unassigned_shards: 1 [15:44:34] (03CR) 10Santhosh: [C: 031] Beta: Enable EventLogging in ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171273 (owner: 10KartikMistry) [15:44:40] PROBLEM - puppet last run on wtp1009 is CRITICAL: CRITICAL: Puppet has 1 failures [15:44:50] RECOVERY - ElasticSearch health check on elastic1002 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 30: number_of_data_nodes: 30: active_primary_shards: 2065: active_shards: 6211: relocating_shards: 4: initializing_shards: 0: unassigned_shards: 0 [15:44:51] PROBLEM - puppet last run on wtp1008 is CRITICAL: CRITICAL: Puppet has 2 failures [15:44:51] PROBLEM - puppet last run on wtp1011 is CRITICAL: CRITICAL: Puppet has 2 failures [15:45:12] PROBLEM - puppet last run on wtp1007 is CRITICAL: CRITICAL: Puppet has 2 failures [15:45:41] PROBLEM - Parsoid on wtp1009 is CRITICAL: Connection refused [15:47:42] RECOVERY - puppet last run on wtp1012 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [15:48:23] Who's SWATting? [15:48:38] I guess ^d [15:48:43] Jesus look at them patches [15:50:28] * anomie sees nothing for SWAT this morning [15:53:08] <^d|voted> i swatted last night. [15:53:12] <^d|voted> calendar is confusing now. [15:54:20] RECOVERY - NTP on wtp1012 is OK: NTP OK: Offset -0.006633758545 secs [15:54:40] RECOVERY - NTP on wtp1010 is OK: NTP OK: Offset -0.004110217094 secs [15:54:46] (03PS1) 10Chad: frwiki gets Cirrus as default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171276 [15:55:44] RECOVERY - puppet last run on wtp1008 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [15:56:40] RECOVERY - Parsoid on wtp1008 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.009 second response time [15:58:06] why are there no mobile sites for chapter wikis? [15:59:23] <^d> nobody in the chapters has a cell phone? [15:59:56] many do [16:00:04] manybubbles, anomie, ^d, marktraceur: Dear anthropoid, the time has come. Please deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141105T1600). [16:00:07] HALP! [16:00:19] Ohhh. [16:00:21] <^d> jouncebot: go away jouncebot, no swat [16:00:51] Right, now that DST is over, the last SWAT of the day is the first SWAT of the day. [16:02:10] RECOVERY - Parsoid on wtp1009 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.025 second response time [16:02:24] RECOVERY - puppet last run on wtp1009 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [16:07:40] (03PS8) 10Krinkle: [WIP] contint: Apply contint::qunit_localhost to labs slaves [puppet] - 10https://gerrit.wikimedia.org/r/168631 [16:11:51] RECOVERY - puppet last run on wtp1011 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [16:12:31] RECOVERY - Parsoid on wtp1011 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.009 second response time [16:12:45] (03PS9) 10Krinkle: [WIP] contint: Apply contint::qunit_localhost to labs slaves [puppet] - 10https://gerrit.wikimedia.org/r/168631 (https://bugzilla.wikimedia.org/72063) [16:13:02] RECOVERY - Parsoid on wtp1007 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.034 second response time [16:13:20] RECOVERY - puppet last run on wtp1007 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [16:13:44] Now SWATs are described as "Morning SWAT" and "Evening SWAT" in the calendar [16:14:00] Hopefully greg-g is still copying the last week for a template, and my changes are sufficient [16:15:11] RECOVERY - Parsoid on wtp1010 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.009 second response time [16:15:20] RECOVERY - puppet last run on wtp1010 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [16:24:09] marktraceur: Such timezone favoritism. 00:00–01:00 UTC is obviously morning somewhere and 16:00–17:00 UTC is hardly morning in the same location. :) [16:24:30] bd808: They're names [16:24:42] bd808: We could also call them Florglenut SWAT and Skiddlyboop SWAT [16:24:45] If you'd prefer [16:25:19] But the deployments calendar is already time-racist [16:25:26] "that one time" swat and "oops it's time again" swat [16:28:08] marktraceur: yep [16:28:23] Cool beans [16:28:52] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [16:30:56] (03CR) 10Nuria: [C: 031] varnish: allow POST for EventLogging on bits [puppet] - 10https://gerrit.wikimedia.org/r/170883 (owner: 10Faidon Liambotis) [16:34:18] (03CR) 10Manybubbles: [C: 031] frwiki gets Cirrus as default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171276 (owner: 10Chad) [16:37:55] what happened to elastic1022? [16:44:15] PROBLEM - Host wtp1009 is DOWN: PING CRITICAL - Packet loss = 100% [16:44:15] PROBLEM - Host wtp1002 is DOWN: PING CRITICAL - Packet loss = 100% [16:44:26] PROBLEM - Host wtp1011 is DOWN: PING CRITICAL - Packet loss = 100% [16:44:31] PROBLEM - Host wtp1008 is DOWN: PING CRITICAL - Packet loss = 100% [16:44:34] PROBLEM - Host wtp1007 is DOWN: PING CRITICAL - Packet loss = 100% [16:44:55] RECOVERY - Host wtp1011 is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms [16:44:55] RECOVERY - Host wtp1008 is UP: PING OK - Packet loss = 0%, RTA = 1.20 ms [16:44:55] RECOVERY - Host wtp1009 is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [16:44:55] RECOVERY - Host wtp1007 is UP: PING OK - Packet loss = 0%, RTA = 1.10 ms [16:44:55] RECOVERY - Host wtp1002 is UP: PING OK - Packet loss = 0%, RTA = 1.70 ms [16:46:15] <^d> manybubbles: Networking cable loose possibly? [16:46:26] <^d> We tried kicking it last night but could only reach from mgmt. [16:46:52] <^d> 01:11 mutante: elatic1022 - eth0: PROBLEM - Apache HTTP on mw1031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:29] PROBLEM - HHVM rendering on mw1031 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:51:38] <^d> manybubbles: ~10 minutes until go time. let's check in. any last reasons to abort frwiki? [16:52:06] <^d> (frwiki has already rebuilt overnight, so it's all good to go there too :)) [16:52:20] ^d: nope! [16:52:32] hi operations... where are production varnish config files..? (thanks in advance) [16:52:49] RECOVERY - check if salt-minion is running on wtp1012 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [16:53:48] RECOVERY - Parsoid on wtp1012 is OK: HTTP OK: HTTP/1.1 200 OK - 1108 bytes in 0.018 second response time [16:55:07] I was walking in the countryside when I came upon a cow. Since it didn't have a name, I called it "cow". [16:55:17] !repool wtp1002, wtp1007-1012 [16:56:16] Where, oh where, are the varnish config files? Oh where, oh where can they be...? ♬ [16:56:41] <^d> puppet? [16:56:49] <^d> :) [16:57:09] Hmm OK lemme loooook [16:57:10] AndyRussG: operations/puppet and git grep is your friend [16:57:20] mostly modules/varnish though [16:58:13] cool thanks much ^d and akosiaris.... I also see an operations/puppet/varnish repo... [16:59:13] it was an effort to use submodules for varnish IIRC, but it endup complicating things so we undid the submodule thing but the repo is still there [16:59:31] <^d> akosiaris: repos can be deleted :) [16:59:38] ah interesting... [16:59:40] oh ? :P [17:00:02] <^d> akosiaris: we have a plugin for it :) [17:00:04] manybubbles, ^d: Respected human, time to deploy Search (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141105T1700). Please do the needful. [17:00:18] yeah, I know. I already used it a couple of times [17:00:25] <^d> yay search time. buckle up and hang on to your pants folks :D [17:00:49] Thanks again, have fun [17:00:54] <^d> manybubbles: you ready? [17:00:56] which begs the question.. does phabricator/diffusion have a way to delete a project/repo ? [17:01:00] sure [17:01:11] <^d> akosiaris: Yes. [17:01:19] (03CR) 10Chad: [C: 032] frwiki gets Cirrus as default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171276 (owner: 10Chad) [17:01:27] (03Merged) 10jenkins-bot: frwiki gets Cirrus as default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171276 (owner: 10Chad) [17:02:08] !log demon Synchronized wmf-config/InitialiseSettings.php: frwiki getting cirrusy search (duration: 00m 05s) [17:02:17] Logged the message, Master [17:03:09] ^d: good to know. Thanks [17:03:35] <^d> manybubbles: I see traffic. [17:04:45] looks to be working [17:05:30] <^d> Hard to see the traffic impact on cpu, etc since we're reindexing, but looks to be minimal. [17:05:31] still basically no io load which is what we want [17:05:37] yeah [17:05:40] <^d> indeed [17:05:47] cpu bump isn't really a big deal [17:05:53] its io I think we should be watching [17:06:17] very little difference in load average which is perfect [17:06:19] <^d> I'm just making sure it doesn't skyrocket like it did the first time we tried to switch almost everyone over :p [17:07:11] cool [17:07:50] akosiaris: is parsoid all trusty now? [17:07:53] I've finished the bulk of that java work for the regex stuff. I'm going to integrate it with cirrus now. then cut releases this afternoon. [17:08:03] looks like elasticsearch 1.4.0 was released a whilie ago [17:08:10] <^d> Oh and we missed it? [17:08:11] we probably ought to think about that in a few weeks [17:08:11] <^d> :) [17:08:34] <^d> CirrusSearch-NamespaceLookup is hitting pool counter limits. Not often, but it's popping up. [17:08:43] <^d> I wonder if that even needs poolcountering. [17:09:26] ^d: probably should just increase it [17:09:43] better to have a limit than none [17:09:50] I think we set it pretty low by default [17:10:05] <^d> 50 workers, 100 queue. [17:10:09] <^d> i'm going to double queue to 200. [17:10:26] k [17:10:33] I didn't expect _that_ many hits to it [17:10:45] might be an error. [17:10:52] probably not a huge one, but an error [17:11:09] (03PS1) 10Chad: Increase CirrusSearch-NamespaceLookup poolcounter queue to 100 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171286 [17:11:12] akosiaris: nm, 100{1,3-6} still look like node 0.8 [17:11:26] its reasonably frequent actually [17:11:27] (03PS2) 10Chad: Increase CirrusSearch-NamespaceLookup poolcounter queue to 200 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171286 [17:11:35] (03CR) 10jenkins-bot: [V: 04-1] Increase CirrusSearch-NamespaceLookup poolcounter queue to 200 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171286 (owner: 10Chad) [17:11:47] <^d> crap. [17:11:52] <^d> that's what I get for using commit -a [17:12:21] (03PS3) 10Chad: Increase CirrusSearch-NamespaceLookup poolcounter queue to 200 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171286 [17:13:25] (03CR) 10Manybubbles: [C: 032] Increase CirrusSearch-NamespaceLookup poolcounter queue to 200 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171286 (owner: 10Chad) [17:13:32] (03CR) 10Manybubbles: [C: 031] Increase CirrusSearch-NamespaceLookup poolcounter queue to 200 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171286 (owner: 10Chad) [17:14:00] (03CR) 10Chad: [C: 032] Increase CirrusSearch-NamespaceLookup poolcounter queue to 200 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171286 (owner: 10Chad) [17:14:07] (03Merged) 10jenkins-bot: Increase CirrusSearch-NamespaceLookup poolcounter queue to 200 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171286 (owner: 10Chad) [17:14:34] !log demon Synchronized wmf-config/PoolCounterSettings-eqiad.php: (no message) (duration: 00m 06s) [17:14:42] Logged the message, Master [17:15:54] gwicke: I am hoping to be done with that tomorrow [17:16:26] Who deals with Google Webmaster Tools? [17:16:47] <^d> Krenair: Check the blamewheel. Changes every time. [17:17:33] akosiaris: no rush; thanks again for doing this so smoothly! [17:18:37] gwicke: just doing my work :-) [17:19:13] Because someone's reported this happening: https://support.google.com/websearch/answer/190597 for a wikimedia site [17:19:44] I see the same thing [17:23:24] nic [17:23:25] nice [17:26:27] <^d> Krenair: Maybe because someone's been editing all those pages! [17:26:36] <^d> hackers! [17:30:28] They'll be free, ^demon. They'll be free. [17:35:44] marktraceur: ¡Libre soy! http://www.musica.com/letras.asp?letra=2163310 https://www.youtube.com/watch?v=RstJq4kIdp4 [17:38:18] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, won't have time to deploy and check it works this week tho" [puppet] - 10https://gerrit.wikimedia.org/r/171153 (owner: 10Ori.livneh) [17:39:18] PROBLEM - Disk space on ocg1003 is CRITICAL: DISK CRITICAL - free space: / 349 MB (3% inode=72%): [17:39:50] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: puppet fail [17:40:57] RECOVERY - Host elastic1022 is UP: PING OK - Packet loss = 0%, RTA = 4.89 ms [17:41:31] manybubbles: not sure how it happened but the network cable was loose on the elastic1022. I must've bumped it yesterday while in the rack and didn't realize [17:41:46] cmjohnson: ah cool. its fine! [17:41:54] I was just afraid it was something bad [17:43:48] PROBLEM - puppet last run on elastic1022 is CRITICAL: CRITICAL: puppet fail [17:46:05] (03PS5) 10Reedy: Remove sync-l10nupdate(-1)? [puppet] - 10https://gerrit.wikimedia.org/r/158624 [17:46:54] <^d> cmjohnson: I think mutante filed an RT about it last night too [17:46:57] RECOVERY - puppet last run on elastic1022 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [17:46:57] (03CR) 10Ori.livneh: [C: 031] hhvm: remove unnecessary upstart stanza, config option [puppet] - 10https://gerrit.wikimedia.org/r/171244 (owner: 10Giuseppe Lavagetto) [17:47:11] <^d> cmjohnson: #8811 [17:49:58] PROBLEM - puppet last run on db2007 is CRITICAL: CRITICAL: puppet fail [17:50:04] cool thx ^d [17:54:17] (03CR) 10Filippo Giunchedi: jheapdump: gdb-based heap dump for JVM (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/170996 (owner: 10Filippo Giunchedi) [17:54:43] (03PS2) 10Filippo Giunchedi: jheapdump: gdb-based heap dump for JVM [puppet] - 10https://gerrit.wikimedia.org/r/170996 [17:55:32] (03CR) 10Filippo Giunchedi: "Nik, you are right I've added gdb dep" [puppet] - 10https://gerrit.wikimedia.org/r/170996 (owner: 10Filippo Giunchedi) [17:55:58] PROBLEM - NTP on elastic1022 is CRITICAL: NTP CRITICAL: Offset unknown [17:56:10] bd808: tonythomas What are we doing about Plancake/composer? [17:56:38] PROBLEM - HHVM rendering on mw1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:56:58] PROBLEM - Apache HTTP on mw1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:57:00] just getting things arranged in my head for the branching/deploy today [17:57:11] Reedy: I think I'm ready to give a +2 with the condition that tonythomas keeps trying to figure out if it is abandonware or not. [17:57:18] Thoughts? [17:57:23] That WFM, yeah [17:57:46] It's that, or we just include a specific version in the extension itself [17:57:47] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [17:57:52] * bd808 nods [17:58:06] I saw he opened an issue about releases and you commented on it upstream [17:58:23] Yeah. If it's dead in the wild I have no problem with it being embedded directly in the extension [17:58:51] It's not much work to change it either way at least [18:00:27] Reedy: Do you want me to +2 now? [18:00:35] That'd be good [18:01:41] {{done}} [18:01:44] Thanks [18:01:46] tonythomas: ^^ [18:02:26] <^d> !log elastic1022 unbanned from allocation since it has a network cable again [18:02:33] <^d> manybubbles: ^ [18:02:35] Logged the message, Master [18:02:42] ^d: thanks! [18:03:06] <^d> Yeah I'd banned it last night just in case it started flapping on and off. [18:03:50] ^d: Can you sanity check https://gerrit.wikimedia.org/r/#/c/170025/ please? [18:04:15] <^d> I looked at it. Something feels wrong. [18:04:26] <^d> I can't put my finger on it though. [18:04:48] aude: ^^ [18:05:14] (03PS3) 10Ottomata: Add defines for working with mysql config files, and mysql client settings [puppet] - 10https://gerrit.wikimedia.org/r/169722 [18:06:00] (03CR) 10Ottomata: [C: 032] Add defines for working with mysql config files, and mysql client settings [puppet] - 10https://gerrit.wikimedia.org/r/169722 (owner: 10Ottomata) [18:07:44] (03PS1) 10Reedy: Add BounceHandler to extension-list for deploy today [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171294 [18:07:48] RECOVERY - puppet last run on db2007 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [18:08:13] looiing [18:08:15] look* [18:09:21] (03PS3) 10Ori.livneh: base: standardize the path and file name of core dumps [puppet] - 10https://gerrit.wikimedia.org/r/171206 [18:09:29] ^ chasemp, what do you think of that? [18:10:43] I think looks good and why didn't I know about tidy? [18:10:45] good stuff [18:10:56] oh, Reedy, did you see: https://phabricator.wikimedia.org/T1108 [18:10:57] (03PS4) 10Ottomata: Include research mysql user and password in file on stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/168993 [18:10:57] :) [18:11:26] no.. I saw bd808 did mention it and ping me in operations about it though ;) [18:11:43] is there anything in branchedExtensions that needs the $preservedRefs[$name] stuff? [18:11:45] <^d> I left a commentssss [18:11:48] ori: are there any like reserach boxes [18:11:50] ok [18:11:53] that have way more ram than disk? [18:11:57] may not work out [18:12:00] otherwise cool [18:12:14] may want to make this more selective than default [18:12:20] i can try to test it (omitting the git push parts) [18:12:40] (03PS5) 10Ottomata: Include research mysql user and password in file on stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/168993 [18:12:40] chasemp: my first patch was just for hhvm, but then i thought: it's worth it to standardize this across the fleet [18:13:03] ori: I'm not opposed necesarily, I just don't if there are legacy boxes that don't have /var partition [18:13:04] chasemp: it'd be one thing if we had existing conflicting core_patterns, but we don't, so let's do it right before we start having different conventions for different hosts / services, no? [18:13:06] and maybe lots of ram [18:13:09] <^d> aude: Do you see what I'm saying? Tracking a sha1 feels weird to me. [18:13:17] i see [18:13:21] we do that sometimes? [18:13:46] <^d> Well $specialExtensions supports copying the previous branches' state, although we don't use that in practice right now. [18:13:58] ok [18:13:58] something like a stat box may be all cpu and ram little disk [18:14:04] i can poke at it more and try to test [18:14:10] if that thing core dumps and has no /var partition fun times [18:14:30] it's just weird that wikidata special:version says last updated in may :p [18:14:31] chasemp: every host has /var/tmp, i just checked [18:14:43] (03CR) 10Ottomata: [C: 032] Include research mysql user and password in file on stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/168993 (owner: 10Ottomata) [18:14:46] mutante: hey! [18:14:48] where it's on a separate partition? [18:15:00] where's the private puppet repo again? [18:15:46] (03PS4) 10Ori.livneh: base: standardize the path and file name of core dumps [puppet] - 10https://gerrit.wikimedia.org/r/171206 [18:16:04] chasemp: dunno [18:16:18] me neither :) [18:16:33] but better find out before enable core dumps on them when I don't know RAM/disk ratio's either [18:17:16] <^d> aude: Yeah, I totally understand where you're coming from. And your patch is probably ok...we could just remove support for the "copy previous state" that we don't use if it's a problem. [18:19:28] chasemp: http://everything2.com/title/How+to+get+people+to+clean+up+their+core+dumps "When he went to lunch (without locking his workstation) I edited his .login, changed the default core dump directory to /dev/audio, made sure that the speaker volume was at the maximum and logged him out." [18:20:02] heheh nice [18:20:36] ^d: i'm poking more :) [18:21:44] (03PS1) 10Ottomata: Use mysql::config::client to render research db user and password [puppet] - 10https://gerrit.wikimedia.org/r/171298 [18:24:05] (03CR) 10Ottomata: [C: 032] Use mysql::config::client to render research db user and password [puppet] - 10https://gerrit.wikimedia.org/r/171298 (owner: 10Ottomata) [18:30:54] !log restarting icinga on neon [18:31:02] Logged the message, Master [18:45:57] PROBLEM - RAID on nickel is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:46:56] RECOVERY - RAID on nickel is OK: OK: Active: 3, Working: 3, Failed: 0, Spare: 0 [18:51:30] (03CR) 10Aaron Schulz: "Why not metawiki?" [puppet] - 10https://gerrit.wikimedia.org/r/168622 (owner: 1001tonythomas) [18:51:58] PROBLEM - puppet last run on hooft is CRITICAL: CRITICAL: puppet fail [18:52:24] paravoid: https://gerrit.wikimedia.org/r/127460 [18:52:32] thanks [18:55:23] (03CR) 1001tonythomas: "@Aaron: we had our discussions on chosing meta/en-wiki over here https://bugzilla.wikimedia.org/show_bug.cgi?id=69099#c18" [puppet] - 10https://gerrit.wikimedia.org/r/168622 (owner: 1001tonythomas) [18:55:57] PROBLEM - puppet last run on labsdb1006 is CRITICAL: CRITICAL: Puppet last ran 10 hours ago [18:58:16] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [19:00:04] Reedy, greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141105T1900). Please do the needful. [19:05:41] (03PS2) 10Dzahn: toollabs: install p7zip-full on exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/171192 [19:08:17] (03CR) 10coren: [C: 032] "Yes, yes it is." [puppet] - 10https://gerrit.wikimedia.org/r/171192 (owner: 10Dzahn) [19:08:24] (03PS3) 10coren: toollabs: install p7zip-full on exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/171192 (owner: 10Dzahn) [19:09:27] RECOVERY - puppet last run on labsdb1006 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [19:11:02] (03CR) 10Dzahn: [C: 032] toollabs: install p7zip-full on exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/171192 (owner: 10Dzahn) [19:11:47] RECOVERY - puppet last run on hooft is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [19:12:56] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [19:17:44] !log removed libvips-dev and libvips-tools from our custom repo for Trusty. The default packages seem to work fine. [19:17:53] Logged the message, Master [19:21:09] (03CR) 10Dzahn: "yea, uhm, i don't want to create one of these services on each mediawiki appserver in icinga, i just want it a single time, belonging to t" [puppet] - 10https://gerrit.wikimedia.org/r/171193 (owner: 10Dzahn) [19:30:28] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [19:32:17] (03PS2) 10GWicke: WIP: RESTBase puppet module [puppet] - 10https://gerrit.wikimedia.org/r/167213 [19:35:01] (03PS2) 10Reedy: Add BounceHandler to extension-list for deploy today [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171294 [19:35:28] PROBLEM - Disk space on ocg1002 is CRITICAL: DISK CRITICAL - free space: / 349 MB (3% inode=73%): [19:36:05] (03CR) 10Reedy: [C: 032] Add BounceHandler to extension-list for deploy today [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171294 (owner: 10Reedy) [19:36:12] (03Merged) 10jenkins-bot: Add BounceHandler to extension-list for deploy today [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171294 (owner: 10Reedy) [19:36:18] (03CR) 10Faidon Liambotis: [C: 032] Kill all (outdated) references to pmtpa [dns] - 10https://gerrit.wikimedia.org/r/171265 (owner: 10Faidon Liambotis) [19:36:25] (03CR) 10Faidon Liambotis: [C: 032] Allocate frack-codfw [dns] - 10https://gerrit.wikimedia.org/r/171266 (owner: 10Faidon Liambotis) [19:43:20] (03PS1) 10Reedy: Add 1.25wmf7 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171315 [19:43:22] (03PS1) 10Reedy: testwiki to 1.25wmf7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171316 [19:43:24] (03PS1) 10Reedy: Wikipedias to 1.25wmf6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171317 [19:43:26] (03PS1) 10Reedy: group0 to 1.25wmf7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171318 [19:44:06] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [19:44:32] (03CR) 10Reedy: [C: 032] Add 1.25wmf7 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171315 (owner: 10Reedy) [19:44:39] (03CR) 10Reedy: [C: 032] testwiki to 1.25wmf7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171316 (owner: 10Reedy) [19:44:41] (03Merged) 10jenkins-bot: Add 1.25wmf7 symlinks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171315 (owner: 10Reedy) [19:44:47] (03Merged) 10jenkins-bot: testwiki to 1.25wmf7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171316 (owner: 10Reedy) [19:45:31] (03PS1) 10Andrew Bogott: Install libvips, libvips-dev and libvips-tools on Trusty. [puppet] - 10https://gerrit.wikimedia.org/r/171319 [19:46:09] (03PS5) 1001tonythomas: Make BounceHandler extension work on meta-wiki [puppet] - 10https://gerrit.wikimedia.org/r/168622 [19:46:15] !log reedy Started scap: testwiki to 1.25wmf7, build l10n cache [19:46:22] Logged the message, Master [19:47:23] !log reedy scap failed: CalledProcessError Command '/usr/local/bin/mwscript mergeMessageFileList.php --wiki="fawiki" --list-file="/srv/mediawiki-staging/wmf-config/extension-list" --output="/tmp/tmp.4wTY29z5Gg" ' returned non-zero exit status 1 (duration: 01m 08s) [19:47:31] Logged the message, Master [19:47:38] damn it [19:47:43] !log reedy Started scap: testwiki to 1.25wmf7, build l10n cache [19:47:56] !log reedy scap failed: CalledProcessError Command '/usr/local/bin/mwscript mergeMessageFileList.php --wiki="fawiki" --list-file="/srv/mediawiki-staging/wmf-config/extension-list" --output="/tmp/tmp.37qNnawZ9J" --verbose' returned non-zero exit status 1 (duration: 00m 13s) [19:48:09] oh, wmf5 [19:48:10] GRRR [19:49:14] * Reedy live hacks [19:49:22] !log reedy Started scap: testwiki to 1.25wmf7, build l10n cache [19:49:24] :( [19:50:39] Hmm [19:50:41] (03PS3) 10Tim Landscheidt: geturls: Fix pyflakes warnings [software] - 10https://gerrit.wikimedia.org/r/169253 [19:50:50] I should backport adding Plancake to 1.25wmf6 vendor too [19:51:13] No point attempting to cherry pick though [19:54:38] PROBLEM - ElasticSearch health check on elastic1029 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2057: active_shards: 6187: relocating_shards: 16: initializing_shards: 1: unassigned_shards: 1 [19:56:39] RECOVERY - ElasticSearch health check on elastic1029 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2059: active_shards: 6193: relocating_shards: 14: initializing_shards: 0: unassigned_shards: 0 [20:01:18] PROBLEM - ElasticSearch health check on elastic1025 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2058: active_shards: 6190: relocating_shards: 16: initializing_shards: 1: unassigned_shards: 1 [20:02:18] RECOVERY - ElasticSearch health check on elastic1025 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2059: active_shards: 6193: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [20:12:05] (03PS1) 10Jgreen: add a neutral SPF record for wikipedia.org and other domains of template [dns] - 10https://gerrit.wikimedia.org/r/171324 [20:17:30] (03CR) 10Dzahn: [C: 032] "this class just used on tin, mysql != mysql_wmf != mariadb, and only installs the client anyways" [puppet] - 10https://gerrit.wikimedia.org/r/170486 (owner: 10Dzahn) [20:18:08] PROBLEM - ElasticSearch health check on elastic1010 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2060: active_shards: 6196: relocating_shards: 16: initializing_shards: 1: unassigned_shards: 1 [20:18:08] PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2060: active_shards: 6196: relocating_shards: 16: initializing_shards: 1: unassigned_shards: 1 [20:18:08] PROBLEM - ElasticSearch health check on elastic1020 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2060: active_shards: 6196: relocating_shards: 16: initializing_shards: 1: unassigned_shards: 1 [20:18:08] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2060: active_shards: 6196: relocating_shards: 16: initializing_shards: 1: unassigned_shards: 1 [20:18:08] PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2060: active_shards: 6196: relocating_shards: 16: initializing_shards: 1: unassigned_shards: 1 [20:18:16] PROBLEM - ElasticSearch health check on elastic1024 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2060: active_shards: 6196: relocating_shards: 16: initializing_shards: 1: unassigned_shards: 1 [20:18:18] PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2060: active_shards: 6196: relocating_shards: 16: initializing_shards: 1: unassigned_shards: 1 [20:18:18] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2060: active_shards: 6196: relocating_shards: 16: initializing_shards: 1: unassigned_shards: 1 [20:19:14] manybubbles: ^d ^ [20:19:17] RECOVERY - ElasticSearch health check on elastic1024 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2061: active_shards: 6199: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [20:19:19] RECOVERY - ElasticSearch health check on elastic1010 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2061: active_shards: 6199: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [20:19:19] RECOVERY - ElasticSearch health check on elastic1013 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2061: active_shards: 6199: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [20:19:19] RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2061: active_shards: 6199: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [20:19:19] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2061: active_shards: 6199: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [20:19:19] RECOVERY - ElasticSearch health check on elastic1020 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2061: active_shards: 6199: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [20:19:20] RECOVERY - ElasticSearch health check on elastic1014 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2061: active_shards: 6199: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [20:19:20] RECOVERY - ElasticSearch health check on elastic1006 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2061: active_shards: 6199: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [20:19:26] don't do that [20:19:33] its the rebuilds [20:19:41] the monitoring is jumpy during them [20:20:01] didn't do anything. ok [20:21:29] (03PS1) 10Ottomata: Remove eventlogging logs older than 90 days [puppet] - 10https://gerrit.wikimedia.org/r/171329 [20:22:12] (03CR) 10jenkins-bot: [V: 04-1] Remove eventlogging logs older than 90 days [puppet] - 10https://gerrit.wikimedia.org/r/171329 (owner: 10Ottomata) [20:22:19] (03PS2) 10Ottomata: Remove eventlogging logs older than 90 days [puppet] - 10https://gerrit.wikimedia.org/r/171329 [20:23:00] (03CR) 10jenkins-bot: [V: 04-1] Remove eventlogging logs older than 90 days [puppet] - 10https://gerrit.wikimedia.org/r/171329 (owner: 10Ottomata) [20:23:12] (03PS3) 10Ottomata: Remove eventlogging logs older than 90 days [puppet] - 10https://gerrit.wikimedia.org/r/171329 [20:24:09] (03CR) 10jenkins-bot: [V: 04-1] Remove eventlogging logs older than 90 days [puppet] - 10https://gerrit.wikimedia.org/r/171329 (owner: 10Ottomata) [20:25:55] (03PS4) 10Ottomata: Remove eventlogging logs older than 90 days [puppet] - 10https://gerrit.wikimedia.org/r/171329 [20:26:10] PROBLEM - ElasticSearch health check on elastic1022 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2059: active_shards: 6193: relocating_shards: 16: initializing_shards: 1: unassigned_shards: 1 [20:26:10] PROBLEM - ElasticSearch health check on elastic1030 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2059: active_shards: 6193: relocating_shards: 16: initializing_shards: 1: unassigned_shards: 1 [20:27:07] RECOVERY - ElasticSearch health check on elastic1022 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2059: active_shards: 6193: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [20:27:07] RECOVERY - ElasticSearch health check on elastic1030 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2059: active_shards: 6193: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [20:27:24] (03CR) 10Ottomata: [C: 032] Remove eventlogging logs older than 90 days [puppet] - 10https://gerrit.wikimedia.org/r/171329 (owner: 10Ottomata) [20:30:15] (03PS2) 10Dzahn: fix up ordering for salt-minion package, config, service [puppet] - 10https://gerrit.wikimedia.org/r/162860 (owner: 10ArielGlenn) [20:30:50] (03CR) 10Dzahn: fix up ordering for salt-minion package, config, service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/162860 (owner: 10ArielGlenn) [20:33:37] (03CR) 10Ottomata: [C: 031] access: grant reedy access to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/170035 (owner: 10Matanya) [20:34:26] !log reedy Finished scap: testwiki to 1.25wmf7, build l10n cache (duration: 45m 03s) [20:34:35] that was... slow [20:34:35] :/ [20:34:35] Logged the message, Master [20:34:43] (03PS2) 10Dzahn: access: grant reedy access to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/170035 (owner: 10Matanya) [20:35:49] (03CR) 10Reedy: [C: 032] Wikipedias to 1.25wmf6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171317 (owner: 10Reedy) [20:35:57] Reedy: 4.x business days, of which 3 are waiting :p [20:35:58] (03Merged) 10jenkins-bot: Wikipedias to 1.25wmf6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171317 (owner: 10Reedy) [20:36:52] mutante: I was meaning scap [20:36:53] Haha [20:36:54] (03PS4) 10Faidon Liambotis: varnish: allow POST for EventLogging on bits [puppet] - 10https://gerrit.wikimedia.org/r/170883 [20:37:01] (03CR) 10Faidon Liambotis: [C: 032] varnish: allow POST for EventLogging on bits [puppet] - 10https://gerrit.wikimedia.org/r/170883 (owner: 10Faidon Liambotis) [20:37:17] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias to 1.25wmf6 [20:37:21] Reedy: lol, ok :) [20:37:30] Logged the message, Master [20:37:47] (03CR) 10Reedy: [C: 032] group0 to 1.25wmf7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171318 (owner: 10Reedy) [20:37:51] (03CR) 10Dzahn: [C: 032] "has approval from greg. waiting period over as well." [puppet] - 10https://gerrit.wikimedia.org/r/170035 (owner: 10Matanya) [20:37:55] (03Merged) 10jenkins-bot: group0 to 1.25wmf7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171318 (owner: 10Reedy) [20:38:19] Reedy: welcome to being analytics admin :0 [20:39:06] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.25wmf7 [20:39:14] Logged the message, Master [20:40:05] (03PS2) 10Reedy: enwiki: Add Draft: namespace to wgContentNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171024 (owner: 10Jforrester) [20:40:09] (03CR) 10Reedy: [C: 032] enwiki: Add Draft: namespace to wgContentNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171024 (owner: 10Jforrester) [20:40:17] Whee. [20:40:19] (03Merged) 10jenkins-bot: enwiki: Add Draft: namespace to wgContentNamespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171024 (owner: 10Jforrester) [20:40:31] James_F: Should I do https://gerrit.wikimedia.org/r/#/c/157478/ today? [20:40:40] It's "scheduled" for tomorrow [20:40:46] But I'm guessing it was supposed to ride the train [20:40:49] Reedy: No, wait for tomorrow. [20:41:02] Reedy: We've announced it for tomorrow, less confusing for it to go out as announced. [20:41:46] (03PS5) 10Faidon Liambotis: varnish: allow POST for EventLogging on bits [puppet] - 10https://gerrit.wikimedia.org/r/170883 [20:41:49] <^d> mutante, manybubbles: Yeah that's me. [20:41:54] * ^d kicks elastic a bit [20:41:59] <^d> Shut up elasticsearch [20:42:14] (03CR) 10Reedy: [C: 04-1] "Won't work as expected" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171273 (owner: 10KartikMistry) [20:43:11] (03PS2) 10Reedy: Remove $wgCentralAuthSilentLogin and $wgCentralAuthUseOldAutoLogin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164013 (owner: 10PleaseStand) [20:43:17] (03CR) 10Reedy: [C: 032] Remove $wgCentralAuthSilentLogin and $wgCentralAuthUseOldAutoLogin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164013 (owner: 10PleaseStand) [20:43:25] (03Merged) 10jenkins-bot: Remove $wgCentralAuthSilentLogin and $wgCentralAuthUseOldAutoLogin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/164013 (owner: 10PleaseStand) [20:44:58] !log reedy Synchronized wmf-config/: (no message) (duration: 00m 17s) [20:45:06] Logged the message, Master [20:45:22] James_F: if you are late ofen, being early once is nice isn't it? :P [20:45:53] matanya: I'm not sure the 800 wikis who'd get TemplateData GUI a day early would see it that way. :-) [20:46:20] :) [20:48:32] (03PS2) 10Reedy: Initial configuration for Maithili Wikipedia (maiwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169758 (https://bugzilla.wikimedia.org/72346) (owner: 10Glaisher) [20:50:45] (03CR) 10Reedy: [C: 032] Initial configuration for Maithili Wikipedia (maiwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169758 (https://bugzilla.wikimedia.org/72346) (owner: 10Glaisher) [20:50:53] (03Merged) 10jenkins-bot: Initial configuration for Maithili Wikipedia (maiwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/169758 (https://bugzilla.wikimedia.org/72346) (owner: 10Glaisher) [20:55:14] !log reedy Synchronized database lists: maiwiki (duration: 00m 18s) [20:55:21] Logged the message, Master [20:56:23] !log reedy Synchronized wmf-config/InitialiseSettings.php: maiwiki (duration: 00m 15s) [20:56:32] Logged the message, Master [20:56:56] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: maiwiki [20:57:05] Logged the message, Master [20:59:35] !log reedy Synchronized database lists: maiwiki (duration: 00m 14s) [20:59:44] Logged the message, Master [21:00:05] gwicke, cscott, arlolra, subbu: Respected human, time to deploy Parsoid/OCG (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141105T2100). Please do the needful. [21:00:07] !log reedy Synchronized wmf-config/InitialiseSettings.php: maiwiki (duration: 00m 14s) [21:00:12] Logged the message, Master [21:00:27] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: maiwiki [21:02:22] !log reedy Synchronized wmf-config/interwiki.cdb: Updating interwiki cache (duration: 00m 15s) [21:02:30] Logged the message, Master [21:02:55] (03PS1) 10Reedy: Update interwiki cache for maiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171436 [21:03:05] (03CR) 10Reedy: [C: 032] Update interwiki cache for maiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171436 (owner: 10Reedy) [21:03:13] (03Merged) 10jenkins-bot: Update interwiki cache for maiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171436 (owner: 10Reedy) [21:03:50] (03CR) 10Reedy: "wmf-config/interwiki.cdb | Bin 820289 -> 820294 bytes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171436 (owner: 10Reedy) [21:05:12] !log Running foreachwikiindblist wikidataclient.dblist extensions/Wikidata/extensions/Wikibase/lib/maintenance/populateSitesTable.php --strip-protocols [21:05:23] Logged the message, Master [21:06:03] Who deals with Google Webmaster Tools? [21:07:08] People! [21:07:42] Krenair: nobody [21:07:56] Who needs it? [21:08:13] "mai" means never. FYI [21:09:21] Crazy https://gerrit.wikimedia.org/r/171024 [21:10:11] <^d> SHIT @ enwiki content namespace change. [21:10:25] * ^d goes to add a big big glaring warning about changing that [21:10:33] Nemo_bis, https://www.google.co.uk/search?q=nerve+cell [21:10:58] <^d> Actually, it has a warning! [21:11:04] <^d> But obvs. nobody saw it :( [21:11:09] <^d> // Note that changing this for wikis with CirrusSearch will remove pages in the [21:11:10] <^d> // the affected namespace from search results until a full reindex is completed. [21:11:20] <^d> James_F: You made Drafts disappear from searches! [21:11:21] Krenair: Philippe, Stuart West, f.e. [21:11:43] ^d: maybe that was the aim? :p [21:11:45] "en.wikipedia.org/wiki/Neuron [21:11:46] This site may be hacked." [21:11:51] Krenair: is it? :) [21:12:02] <^d> Krenair: I'm telling you, being able to edit pages is *not a hack* [21:12:05] <^d> Sheeesh [21:12:15] Someone inform Google! [21:12:25] tell the SEO people that, so they stop asking us to edit for them [21:13:10] mutante: the SEO people probably can't edit, actually [21:13:24] Unless they have big budgets for legal fees, that is [21:13:26] !log deployed parsoid version 978623eb [21:13:34] Logged the message, Master [21:13:58] Krenair: IIRC there is no google wemaster set up for wikipedia.org? Only for wikivoyage. For sure last time I asked nobody knew who could have access :) [21:14:14] there are RT tickets in the past asking for it [21:14:33] Krenair: so the best chance is asking Eloquence or TimStarling to warn Google contacts [21:14:37] but wikivoyage, yea [21:15:05] Anyone can have access [21:15:06] ask Stuart [21:15:10] board member [21:15:18] Just need to get their verification stuff merged into dns ;) [21:15:27] (03CR) 10Chad: "I see no bug linked for this. Was this discussed anywhere? Also, the warning at the top of $wgContentNamespaces wasn't heeded, as we're no" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171024 (owner: 10Jforrester) [21:15:56] <^d> I used to have access to it a long long time ago. [21:16:05] <^d> I remember when we used it to set crud up for wikinews. [21:16:14] lol [21:16:20] I need to do that for GNSM [21:16:46] wikipedia.org is such a boring domain [21:16:54] I needed access for MediaWiki.org [21:17:02] <^d> wikis are boring. [21:18:06] Not all of them [21:18:27] redeploying parsoid in just a minute -- bad warning level config deployed. breaks ve edits. [21:19:22] James_F, RoanKattouw fyi ^ [21:19:33] waiting on zuul to merge. [21:20:09] <^d> Nemo_bis: Yes, all of them. [21:20:10] <^d> Ever. [21:20:13] <^d> Every. [21:20:14] <^d> Last. [21:20:15] <^d> One. [21:20:16] <^d> So boring. [21:20:23] <^d> We should use google docs instead. [21:20:24] <^d> Way cooler. [21:20:39] Does Google Docs have 26 GB of buses? http://hkbus.wikia.com [21:20:56] <^d> Does anyone need 26GB of buses? [21:21:00] Hmpf, my next example shut down http://www.wikinsk.ru/ [21:21:39] (03PS1) 10Ori.livneh: Small fix to gdbinit [puppet] - 10https://gerrit.wikimedia.org/r/171439 [21:24:14] !log redployed parsoid deploy sha 66befe47 (with the right bunyan log level that unbreaks VE) [21:24:22] Logged the message, Master [21:32:47] PROBLEM - ElasticSearch health check on elastic1010 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2062: active_shards: 6201: relocating_shards: 1: initializing_shards: 2: unassigned_shards: 1 [21:32:47] PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2062: active_shards: 6201: relocating_shards: 1: initializing_shards: 2: unassigned_shards: 1 [21:32:47] PROBLEM - ElasticSearch health check on elastic1031 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2062: active_shards: 6201: relocating_shards: 1: initializing_shards: 2: unassigned_shards: 1 [21:32:47] PROBLEM - ElasticSearch health check on elastic1024 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2062: active_shards: 6201: relocating_shards: 1: initializing_shards: 2: unassigned_shards: 1 [21:32:47] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2062: active_shards: 6201: relocating_shards: 1: initializing_shards: 2: unassigned_shards: 1 [21:33:18] <^d> yeah yeah, shut up icinga-wm [21:34:16] PROBLEM - ElasticSearch health check on elastic1029 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2062: active_shards: 6201: relocating_shards: 1: initializing_shards: 2: unassigned_shards: 1 [21:35:49] RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2063: active_shards: 6205: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [21:35:49] RECOVERY - ElasticSearch health check on elastic1024 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2063: active_shards: 6205: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [21:35:49] RECOVERY - ElasticSearch health check on elastic1010 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2063: active_shards: 6205: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [21:35:49] RECOVERY - ElasticSearch health check on elastic1013 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2063: active_shards: 6205: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [21:35:49] RECOVERY - ElasticSearch health check on elastic1031 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2063: active_shards: 6205: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [21:36:07] PROBLEM - ElasticSearch health check on elastic1020 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2062: active_shards: 6202: relocating_shards: 6: initializing_shards: 1: unassigned_shards: 1 [21:36:07] PROBLEM - ElasticSearch health check on elastic1005 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2062: active_shards: 6202: relocating_shards: 6: initializing_shards: 1: unassigned_shards: 1 [21:36:07] PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2062: active_shards: 6202: relocating_shards: 6: initializing_shards: 1: unassigned_shards: 1 [21:36:07] PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2062: active_shards: 6202: relocating_shards: 6: initializing_shards: 1: unassigned_shards: 1 [21:36:07] PROBLEM - ElasticSearch health check on elastic1023 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2062: active_shards: 6202: relocating_shards: 6: initializing_shards: 1: unassigned_shards: 1 [21:37:09] RECOVERY - ElasticSearch health check on elastic1023 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2063: active_shards: 6205: relocating_shards: 5: initializing_shards: 0: unassigned_shards: 0 [21:37:16] RECOVERY - ElasticSearch health check on elastic1006 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2063: active_shards: 6205: relocating_shards: 5: initializing_shards: 0: unassigned_shards: 0 [21:37:16] RECOVERY - ElasticSearch health check on elastic1029 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2063: active_shards: 6205: relocating_shards: 5: initializing_shards: 0: unassigned_shards: 0 [21:38:07] RECOVERY - ElasticSearch health check on elastic1020 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2062: active_shards: 6202: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [21:38:10] RECOVERY - ElasticSearch health check on elastic1005 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2062: active_shards: 6202: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [21:38:10] RECOVERY - ElasticSearch health check on elastic1014 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2062: active_shards: 6202: relocating_shards: 3: initializing_shards: 0: unassigned_shards: 0 [21:48:37] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2061: active_shards: 6197: relocating_shards: 2: initializing_shards: 3: unassigned_shards: 1 [21:49:39] RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2062: active_shards: 6202: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0 [22:00:04] yurik: Respected human, time to deploy Wikipedia Zero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141105T2200). Please do the needful. [22:02:26] (03PS3) 10GWicke: WIP: RESTBase puppet module [puppet] - 10https://gerrit.wikimedia.org/r/167213 [22:03:06] (03CR) 10jenkins-bot: [V: 04-1] WIP: RESTBase puppet module [puppet] - 10https://gerrit.wikimedia.org/r/167213 (owner: 10GWicke) [22:12:14] (03CR) 10Chad: "This might not be a problem, I was mainly just irked about having to rebuild enwiki for a second time this week." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171024 (owner: 10Jforrester) [22:12:39] Hi... I'm trying to figure out how cache expire times are controlled on production... I found varnish vcl stuff in puppet, and mod_expires stuff in apache-config, but I don't get how they interact and how cache expiry times for different types of request are controlled... pointers anyone? [22:12:44] thanks in advance... :) [22:13:07] (03CR) 10Jforrester: "Sorry for not highlighting it to the team." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/171024 (owner: 10Jforrester) [22:16:20] AndyRussG: Expiry times for what exactly? [22:16:29] Static assets? ResourceLoader requests? MediaWiki requests? [22:17:25] Hi RoanKattouw :) ResourceLoader, specifically what we were talking about yesterday, getting custom config info to an extension's resourceloader module [22:17:34] thanks in advance :) [22:18:01] Oh that's controlled from PHP [22:18:06] Ah hmmmmmmmm [22:18:14] ResourceLoader.php outputs Cache-Control headers [22:18:14] Yeah since this is a not-insignificant change in how CentralNotice sends people banners, and the year-end fundraiser is coming up... [22:18:48] I just wanted to try to understand myself the infrastructure and config file involved, and put more details in the design docs... [22:18:49] There is then some standard across-the-board mangling that Varnish does where s-maxage is overwritten to 0 [22:19:14] So Varnish respects the original s-maxage, but it tells any other proxies that may be between us and the user to not cache [22:19:44] AndyRussG: See sendResponseHeaders() in ResourceLoader.php [22:19:54] And feel free to ping me if you have questions, I'm pretty sure I wrote that function [22:20:56] heheh OK, fantastic thanks [22:21:14] I got as far as realizing that it wasn't really controlled by the Varnish config [22:21:25] and started hunting in the apache configs... [22:30:48] (03PS2) 10Andrew Bogott: Install libvips, libvips-dev and libvips-tools on Trusty. [puppet] - 10https://gerrit.wikimedia.org/r/171319 [22:30:51] RoanKattouw: K I'll check out that function and send any more questions your way... :) I guess the only other question I have is, just to check that I'm understanding correctly: the ResourceLoaderGetConfigVars is actually called on a different request from the one most PHP code is running in, no? [22:31:06] Yes [22:31:31] So, as I was saying earlier, we distinguish between "config data" and "page properties" [22:31:35] "config data" is largely request-independent [22:31:55] You have a few things in ResourceLoaderContext like the user's language but not much [22:32:08] "page properties", as the name suggests, vary between pages [22:32:49] So an example of config would be "which namespaces on this wiki allow subpages" or "what are the decimal separation rules for the user's language" whereas examples of page properties would be "what is the current revision ID of this page" [22:33:26] ResourceLoaderGetConfigVars gets config stuff, and MakeGlobalVariablesScript (terrible hook name, was kept for backwards compatibility) gets page property stuff [22:34:11] <^d> Not just a terrible hook name, a terrible hook! [22:34:17] RLGCV is executed on the request for modules=startup [22:34:43] MGVS is executed on normal page requests [22:35:11] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: Puppet has 2 failures [22:35:24] RoanKattouw: cool that's what I imagined..... [22:36:31] thx again :) [22:38:34] (03CR) 10Yuvipanda: [C: 04-1] "This also causes the rebase conflict." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/170484 (owner: 10John F. Lewis) [22:40:16] PROBLEM - check_puppetrun on db1025 is CRITICAL: CRITICAL: Puppet has 2 failures [22:41:30] ^d: hooks are neither good nor evil, they just are [22:41:41] <^d> Hooks are evilllll [22:42:50] only the ones on the feet of giant arctic mosquitos [22:45:08] RECOVERY - check_puppetrun on db1025 is OK: OK: Puppet is currently enabled, last run 220 seconds ago with 0 failures [22:45:41] (03PS3) 10Yuvipanda: Install libvips, libvips-dev and libvips-tools on Trusty. [puppet] - 10https://gerrit.wikimedia.org/r/171319 (owner: 10Andrew Bogott) [22:47:44] (03CR) 10Yuvipanda: [C: 032] Install libvips, libvips-dev and libvips-tools on Trusty. [puppet] - 10https://gerrit.wikimedia.org/r/171319 (owner: 10Andrew Bogott) [22:51:34] YuviPanda: is that applying OK on actual tools nodes? [22:51:42] andrewbogott: I'm running now [22:52:36] andrewbogott: is ok on precise, running on trusty [22:52:57] I tried to test on a test instance but totally failed to get a test instance to puppetize properly. I'm not sure what that was about... [22:53:01] applying the wrong classes I guess [22:53:25] That or our puppet is currently badly broken and can't set up new nodes :( [22:54:14] andrewbogott: seems ok on the trusty host too, so I'd consider it done :) [22:54:22] cool [22:54:59] Lemme see if I can reproduce the problem I was having before... [22:55:18] (03PS2) 10Ori.livneh: Small fix to gdbinit [puppet] - 10https://gerrit.wikimedia.org/r/171439 [22:55:25] (03CR) 10Ori.livneh: [C: 032 V: 032] Small fix to gdbinit [puppet] - 10https://gerrit.wikimedia.org/r/171439 (owner: 10Ori.livneh) [22:55:41] (03CR) 10Yuvipanda: "Seems ok on both tools-login (precise) and tools-trusty (trusty)" [puppet] - 10https://gerrit.wikimedia.org/r/171319 (owner: 10Andrew Bogott) [22:56:35] (03PS1) 10Milimetric: [WIP] Add cron job that generates flow statistics [puppet] - 10https://gerrit.wikimedia.org/r/171465 [23:05:27] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [23:14:47] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [23:14:48] (03PS2) 10Dzahn: beta: linting and autoload modules [puppet] - 10https://gerrit.wikimedia.org/r/170484 (owner: 10John F. Lewis) [23:17:08] (03CR) 10Dzahn: beta: linting and autoload modules (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/170484 (owner: 10John F. Lewis) [23:17:10] w/in 5 [23:20:38] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [23:21:06] (03CR) 10Spage: "Ship it! Only a comment about the rsync command." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/171465 (owner: 10Milimetric) [23:25:27] PROBLEM - puppet last run on mw1023 is CRITICAL: CRITICAL: Puppet last ran 17 hours ago [23:30:28] RECOVERY - puppet last run on mw1023 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [23:48:58] AaronSchulz: fyi, RedisJobRunnerService is spamming fatal.log with [2014-11-05 23:45:20] Fatal error: Allowed memory size of 268435456 bytes exhausted (tried to allocate 262144 bytes) at /srv/deployment/jobrunner/jobrunner/redisJobRunnerService on line 666 [23:49:02] AaronSchulz: every 10 sec or so [23:49:07] PROBLEM - puppet last run on cp4019 is CRITICAL: CRITICAL: puppet fail [23:51:44] (03PS1) 10Dzahn: add missing mobile DNS entries [dns] - 10https://gerrit.wikimedia.org/r/171475 [23:54:08] PROBLEM - Disk space on ocg1001 is CRITICAL: DISK CRITICAL - free space: / 350 MB (3% inode=72%): [23:55:46] (03CR) 10Hoo man: [C: 04-1] "I don't think wikitech has MobileFrontend enabled." [dns] - 10https://gerrit.wikimedia.org/r/171475 (owner: 10Dzahn) [23:57:43] (03CR) 10Reedy: [C: 04-1] add missing mobile DNS entries (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/171475 (owner: 10Dzahn) [23:58:05] (03Abandoned) 10Reedy: Add mobile subdomains for Wikimedia chapter wikis [dns] - 10https://gerrit.wikimedia.org/r/156596 (owner: 10Reedy) [23:58:23] (03PS2) 10Dzahn: add missing mobile DNS entries [dns] - 10https://gerrit.wikimedia.org/r/171475 [23:59:04] (03CR) 10Dzahn: "removed wikitech, removed uk. (and did not add it.) because they redirect elsewhere." [dns] - 10https://gerrit.wikimedia.org/r/171475 (owner: 10Dzahn) [23:59:21] did not add ".it", as in Italy [23:59:58] I am looking at the ocg problem.