[00:00:02] !log ori synchronized wmf-config/squid.php 'Id188979c1: Use whole subnets in squid.php list for XFF acceptance' [00:00:07] Logged the message, Master [00:00:23] (03PS1) 10Dzahn: quoted Booleans in rsync::server::module [operations/puppet] - 10https://gerrit.wikimedia.org/r/133645 [00:03:00] (03PS2) 10Dzahn: quoted Booleans in rsync::server::module [operations/puppet] - 10https://gerrit.wikimedia.org/r/133645 [00:03:32] (03CR) 10Dzahn: "actually this says $read_only - yes||no, defaults to yes, so "true" just happens to work" [operations/puppet] - 10https://gerrit.wikimedia.org/r/133645 (owner: 10Dzahn) [00:04:19] (03CR) 10MaxSem: [C: 032] Use ContentNamespace rather than NearbyNamespace [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131762 (owner: 10Jdlrobson) [00:04:28] (03Merged) 10jenkins-bot: Use ContentNamespace rather than NearbyNamespace [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131762 (owner: 10Jdlrobson) [00:05:18] (03CR) 10BryanDavis: [C: 031] "Cool. This file was a mess before." [operations/puppet] - 10https://gerrit.wikimedia.org/r/133644 (owner: 10Dzahn) [00:06:06] (03CR) 10Dzahn: "matanya will say to split all file resources into separate ones:)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/133644 (owner: 10Dzahn) [00:07:17] ori: thanks for pushing all that through. it was a long road to run down that subproblem of a subproblem of a subproblem, now I can get back on track with other dependent-ish things :) [00:07:51] !log maxsem synchronized wmf-config 'https://gerrit.wikimedia.org/r/#/c/131762/' [00:07:56] Logged the message, Master [00:08:11] (03PS1) 10Ori.livneh: role::analytics::kafka: do not call keys() on non-hash [operations/puppet] - 10https://gerrit.wikimedia.org/r/133646 [00:08:20] bblack: np! thanks for the patch! [00:09:03] (03CR) 10Ori.livneh: [C: 032 V: 032] "ottomata, merging to unbreak beta." [operations/puppet] - 10https://gerrit.wikimedia.org/r/133646 (owner: 10Ori.livneh) [00:09:25] PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2064: active_shards: 6191: relocating_shards: 1: initializing_shards: 1: unassigned_shards: 0 [00:09:25] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2064: active_shards: 6191: relocating_shards: 1: initializing_shards: 1: unassigned_shards: 0 [00:09:25] PROBLEM - ElasticSearch health check on elastic1010 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2064: active_shards: 6191: relocating_shards: 1: initializing_shards: 1: unassigned_shards: 0 [00:09:25] PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2064: active_shards: 6191: relocating_shards: 1: initializing_shards: 1: unassigned_shards: 0 [00:09:25] PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2064: active_shards: 6191: relocating_shards: 1: initializing_shards: 1: unassigned_shards: 0 [00:09:26] PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2064: active_shards: 6191: relocating_shards: 1: initializing_shards: 1: unassigned_shards: 0 [00:09:33] welp [00:09:41] manybubbles: ? [00:09:50] 1 shard relocating [00:09:54] I look [00:10:35] Its the monitoring that is broken.... [00:10:43] I hate this monitoring! [00:11:03] you're not anomalous in that respect [00:11:17] I had something to fix it.... let me look at it [00:11:26] RECOVERY - ElasticSearch health check on elastic1004 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2065: active_shards: 6192: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [00:11:26] RECOVERY - ElasticSearch health check on elastic1011 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2065: active_shards: 6192: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [00:11:26] RECOVERY - ElasticSearch health check on elastic1010 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2065: active_shards: 6192: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [00:11:26] RECOVERY - ElasticSearch health check on elastic1014 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2065: active_shards: 6192: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [00:11:26] RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2065: active_shards: 6192: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [00:11:26] RECOVERY - ElasticSearch health check on elastic1001 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2065: active_shards: 6192: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [00:14:47] (03PS1) 10Dzahn: fix "read_only" setting in rsyncd setups [operations/puppet] - 10https://gerrit.wikimedia.org/r/133647 [00:17:18] (03PS3) 10Dzahn: quoted Booleans in rsync::server::module [operations/puppet] - 10https://gerrit.wikimedia.org/r/133645 [00:20:35] PROBLEM - Puppet freshness on maerlant is CRITICAL: Last successful Puppet run was Tue May 13 21:03:26 2014 [00:21:58] (03PS3) 10Dzahn: add rsyncd config for apache config sync [operations/puppet] - 10https://gerrit.wikimedia.org/r/133633 [00:22:36] (03CR) 10Dzahn: "like misc::nfs-server::home::rsyncd" [operations/puppet] - 10https://gerrit.wikimedia.org/r/133633 (owner: 10Dzahn) [00:23:32] (03PS1) 10Ori.livneh: webperf/navtiming.py: update for latest schema rev [operations/puppet] - 10https://gerrit.wikimedia.org/r/133648 [00:23:43] (03CR) 10Ori.livneh: [C: 032 V: 032] webperf/navtiming.py: update for latest schema rev [operations/puppet] - 10https://gerrit.wikimedia.org/r/133648 (owner: 10Ori.livneh) [00:25:49] (03PS1) 10Dzahn: sync apache:incl.rsync server & network constants [operations/puppet] - 10https://gerrit.wikimedia.org/r/133649 [00:27:12] (03PS2) 10Dzahn: sync apache:incl.rsync server & network constants [operations/puppet] - 10https://gerrit.wikimedia.org/r/133649 [00:28:24] (03CR) 10BryanDavis: "I updated the beta puppet master and forced a puppet run on deployment-eventlogging02 to apply this change. The beta project uses a self-h" [operations/puppet] - 10https://gerrit.wikimedia.org/r/133646 (owner: 10Ori.livneh) [00:28:28] (03PS4) 10Dzahn: add rsyncd config for apache config sync [operations/puppet] - 10https://gerrit.wikimedia.org/r/133633 [00:29:15] " beta project uses a self-hosted puppet master that doesn't automatically sync with the production branch" [00:29:29] i understand why, but it's also kind of against the purpose of beta a bit ? [00:30:55] (03CR) 10Dzahn: [C: 032] add rsyncd config for apache config sync [operations/puppet] - 10https://gerrit.wikimedia.org/r/133633 (owner: 10Dzahn) [00:32:49] !log manually ran rebuildEntityPerPage for Wikidata to fix 2 broken records [00:32:54] Logged the message, Master [00:33:22] If someone could have a look at https://gerrit.wikimedia.org/r/120535 I might not have to run that one manually every now and then... [00:34:50] mutante: Well we have a double sided problem; hashar and I can't merge into operations/puppet.git [00:35:21] So having a local puppet master helps keep things moving [00:35:28] hoo: just says that "not enabled yet" but then it enables it [00:35:42] But it also means that we have to manually update. [00:36:24] mutante: mh? It is supposed to enable it [00:36:33] I've "been meaning to" write a script and cron it to sync with the production branch every 30 minutes or so, but it hasn't happened yet [00:37:01] (03PS3) 10Ori.livneh: Move diamond::generic to manifests/ and lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/132218 [00:37:06] bd808: i understand.. yea [00:37:32] Solution: ops takes over managing the beta environment :) [00:37:48] hoo: what about " # not enabled yet until wikidata gets switched to new build of Wikibase " [00:38:16] where the hell? [00:38:24] That has been done months ago [00:38:28] * hoo rebases [00:38:29] hoo: line 237 [00:38:46] bd808: or you get +2 or we merge everything in beta first.. ?:P [00:39:05] mutante: That line lies to you :P Will rebase and remove that [00:39:35] mutante bd808 merging things to beta first would be pretty great, actually [00:39:36] bd808: sounds like something for scrum of scrum :) [00:39:48] inter-team and workflow related [00:40:35] (03PS4) 10Ori.livneh: diamond: use native Puppet types for collector vars [operations/puppet] - 10https://gerrit.wikimedia.org/r/132218 [00:41:09] (03PS6) 10Hoo man: Run rebuildEntityPerPage.php on Wikidata (once per week) [operations/puppet] - 10https://gerrit.wikimedia.org/r/120535 [00:41:29] (03CR) 10Hoo man: "Rebased, removed an outdated comment" [operations/puppet] - 10https://gerrit.wikimedia.org/r/120535 (owner: 10Hoo man) [00:41:50] damn [00:41:52] (03PS3) 10Dzahn: sync apache:incl.rsync server & network constants [operations/puppet] - 10https://gerrit.wikimedia.org/r/133649 [00:41:56] why is it back to mwdeploy now [00:42:18] * hoo slaps git rebase [00:42:43] (03PS7) 10Hoo man: Run rebuildEntityPerPage.php on Wikidata (once per week) [operations/puppet] - 10https://gerrit.wikimedia.org/r/120535 [00:43:18] mutante: --^ should be good now [00:43:29] (03PS5) 10Ori.livneh: diamond: use native Puppet types for collector vars [operations/puppet] - 10https://gerrit.wikimedia.org/r/132218 [00:44:20] (03CR) 10Dzahn: [C: 032] "both are already included in misc::deployment::scap_primary so no change on tin, but making the role complete" [operations/puppet] - 10https://gerrit.wikimedia.org/r/133649 (owner: 10Dzahn) [00:45:38] (03CR) 10Ori.livneh: [C: 032] diamond: use native Puppet types for collector vars [operations/puppet] - 10https://gerrit.wikimedia.org/r/132218 (owner: 10Ori.livneh) [00:48:41] hoo: running it on terbium [00:49:04] mutante: heh... that's what I did a few minutes ago, but go ahead [00:50:25] (03CR) 10Dzahn: [C: 032] "it took about 40 seconds to run manually on terbium" [operations/puppet] - 10https://gerrit.wikimedia.org/r/120535 (owner: 10Hoo man) [00:52:28] (03PS2) 10Dzahn: dns recursors: add ferm rules [operations/puppet] - 10https://gerrit.wikimedia.org/r/133513 (owner: 10Matanya) [00:52:42] hoo: Misc::Maintenance::Wikidata/Cron[wikibase-rebuild-entityperpage]/ensure: created [00:54:14] Great :) [00:54:54] !log rebuildItemsPerSite finished running for Wikidata (after about 30h). [00:54:59] Logged the message, Master [00:55:14] That's another script [00:55:29] that runs 30hours instead of seconds ?:O [00:55:45] :o [00:55:51] mutante: It loads all items (fully from JSON) and then compares them to the secondary storage... [00:56:00] rebuild entities can query against page table [00:56:04] to find missing [00:56:13] not possible for items [00:56:21] Yep [00:56:21] wow, i see [00:56:29] (03PS1) 10Ori.livneh: webperf/navtiming: tolerate any schema rev. ID [operations/puppet] - 10https://gerrit.wikimedia.org/r/133654 [00:57:13] aude: we fixed about 25k-30k items [00:57:13] wow [00:57:13] and found 7074 with conflicts [00:57:25] (03CR) 10Ori.livneh: [C: 032 V: 032] webperf/navtiming: tolerate any schema rev. ID [operations/puppet] - 10https://gerrit.wikimedia.org/r/133654 (owner: 10Ori.livneh) [00:57:41] will make a list for the community tomorrow... I guess someone can fix them by bot or so [00:57:45] ok [00:57:46] what is the difference between those appservers with weight "12" and those with weight "10" in pybal [00:57:58] the former go to 11 [01:02:17] !log ori synchronized php-1.24wmf5/includes/parser 'I12a60b5cc: Revert "Declare visibility on class properties of includes/parser/"' [01:02:20] Logged the message, Master [01:02:24] ^ aude, MatmaRex [01:02:26] looks better [01:02:53] so does the syntax highlighting [01:04:16] ori: i wonder, is it possible to clear the recent parser cache entries which might have corrupted data now? [01:04:35] PROBLEM - Puppet freshness on fenari is CRITICAL: Last successful Puppet run was Thu May 15 22:04:22 2014 [01:04:50] on feanari.. interesting [01:05:02] although the [edit] link (some of them) on test wikidata look odd now [01:05:04] yes, just hop into the reactor and grab the glowing rod -- if you do it quickly enough it should be safe [01:05:15] MatmaRex: in other words: possible, yes, but i'm not going to try [01:05:25] heh [01:05:33] Duplicate definition: Package[libapache2-mod-php5] is already defined in file /etc/puppet/modules/applicationserver/manifests/packages.pp at line 6 [01:05:38] that new? [01:05:42] that's my change, hang on [01:05:43] clearing a few hours' worth *should* not bring down the wikis [01:05:49] alright [01:06:02] MatmaRex: memcached doesn't have a select * from .. where date ... [01:06:11] (03PS1) 10Jgreen: scope $Rebuild in otrs.TicketExport2Mbox.pl [operations/puppet] - 10https://gerrit.wikimedia.org/r/133656 [01:08:29] (03PS1) 10Ori.livneh: Remove libapache2-mod-php5 from manifests::misc::noc [operations/puppet] - 10https://gerrit.wikimedia.org/r/133657 [01:08:36] mutante: ^ [01:08:45] ori: okay [01:09:31] (03CR) 10Dzahn: [C: 031] "yea, good enough because we are going to get rid of fenari" [operations/puppet] - 10https://gerrit.wikimedia.org/r/133657 (owner: 10Ori.livneh) [01:09:44] (03CR) 10Jgreen: [C: 032 V: 031] scope $Rebuild in otrs.TicketExport2Mbox.pl [operations/puppet] - 10https://gerrit.wikimedia.org/r/133656 (owner: 10Jgreen) [01:10:38] (03CR) 10Ori.livneh: [C: 032] Remove libapache2-mod-php5 from manifests::misc::noc [operations/puppet] - 10https://gerrit.wikimedia.org/r/133657 (owner: 10Ori.livneh) [01:10:40] ori: cool, yea, no need to find elegant solutions for fenari [01:12:33] (03PS1) 10Dzahn: adjust rsync path for httpd configs [operations/puppet] - 10https://gerrit.wikimedia.org/r/133658 [01:14:25] RECOVERY - Puppet freshness on fenari is OK: puppet ran at Fri May 16 01:14:18 UTC 2014 [01:15:39] (03PS5) 10Dzahn: fix sync-apache for use in eqiad [operations/puppet] - 10https://gerrit.wikimedia.org/r/130610 [01:16:09] (03CR) 10Dzahn: [C: 032] adjust rsync path for httpd configs [operations/puppet] - 10https://gerrit.wikimedia.org/r/133658 (owner: 10Dzahn) [01:19:50] ori: re: all my recent changes, so deploying apache from tin should theoretically be possible now, but i don't want to change the docs before i have actually done a deployment (soon!) [01:20:09] (and fenari is untouched so it cant break stuff) [01:21:10] ori man I'm not happy you made this change https://gerrit.wikimedia.org/r/#/c/132218/ [01:21:17] without asking me or putting it up for review [01:21:49] if you had looked at the previous changeset you would have seen daniel and I had that same conversation and together, through appropriate process decided to go with what we had [01:22:17] and you self merging changes on something I'm very clearly working on during my off hours without any discussion is quite inconsiderate [01:22:54] and it's even worse that it was a nonfunctional change as their is non reason not to wait [01:22:58] and to ask me [01:23:58] chasemp: did you see what the patch consists of? [01:24:13] your argument that true in this case should not be quoted is not a good one as that is _not_ a puppet boolean [01:24:31] it is a coincidence that the 'true' literal used in the configuration file shadows true in puppet [01:24:35] sorry, i was sure that that bit was completely uncontroversial [01:24:53] i'd be perfectly okay with reverting [01:25:23] I appreciate that, and I can understand why you thought it was a noop [01:25:32] yes, sorry about that [01:25:37] not my intention to make you feel bypassed at all [01:25:48] but I do feel like since you know I'm currently working out diamond stuff self merging changes to it is not cool man [01:26:13] hey understood, all good, I'm going to revert it, and thank you [01:26:19] (03CR) 10Dzahn: [C: 032] "rsyncing from that IP works, like on test: root@mw1017:/tmp/apachetest# rsync -a 10.64.0.196::httpdconf/ /tmp/apachetest/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/130610 (owner: 10Dzahn) [01:26:31] ok, noted. i did comment in 5/11 saying i'd amend the patch to make it only be about the parameter types [01:27:21] you didn't reply, so i had no reason not to think it wasn't cosmetic. but anyways, not arguing -- you're right. would you like me to deploy the revert? [01:27:42] ^ chasemp [01:27:50] sure thank you, and I didn't notice the other comments or I would have replied man [01:28:04] (03PS3) 10Dzahn: fix apache-fast-test for use in eqiad [operations/puppet] - 10https://gerrit.wikimedia.org/r/130614 [01:28:27] no problem, i'll err on the side of checking with you next time [01:28:54] thank you, I do appreciate it and I'm not trying to be unreasonable [01:29:24] not unreasonable at all, thanks for saying so [01:29:43] (03PS1) 10Ori.livneh: Revert "diamond: use native Puppet types for collector vars" [operations/puppet] - 10https://gerrit.wikimedia.org/r/133660 [01:30:00] chasemp: ^ [01:30:40] alright man thanks again, now I'm go back to watch chopped and eat cheesecake later on [01:30:54] (03CR) 10Dzahn: "merging even though we don't have /srv/pybal yet, but it's not going to be /h/w/ and not breaking anything that worked before" [operations/puppet] - 10https://gerrit.wikimedia.org/r/130614 (owner: 10Dzahn) [01:31:04] * ori waves [01:31:12] (03PS2) 10Ori.livneh: Revert "diamond: use native Puppet types for collector vars" [operations/puppet] - 10https://gerrit.wikimedia.org/r/133660 [01:33:26] (03CR) 10Ori.livneh: [C: 032] "merging per discussion with chasemp" [operations/puppet] - 10https://gerrit.wikimedia.org/r/133660 (owner: 10Ori.livneh) [01:47:16] (03CR) 10Springle: bacula: allow mysqldumps to be kept locally (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/132214 (owner: 10Alexandros Kosiaris) [01:48:33] (03PS1) 10Dzahn: include admins::roots in base [operations/puppet] - 10https://gerrit.wikimedia.org/r/133663 [01:49:31] (03CR) 10jenkins-bot: [V: 04-1] include admins::roots in base [operations/puppet] - 10https://gerrit.wikimedia.org/r/133663 (owner: 10Dzahn) [01:51:01] (03PS2) 10Dzahn: include admins::roots in base [operations/puppet] - 10https://gerrit.wikimedia.org/r/133663 [01:55:47] (03CR) 10Dzahn: [C: 04-1] "actually, just see Ic91eb4f7c6 .. either the rsync module should change or this should be yes/no instead of true/false" [operations/puppet] - 10https://gerrit.wikimedia.org/r/133645 (owner: 10Dzahn) [01:58:53] (03CR) 10Springle: Backup role::mariadb::dbstore (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/132215 (owner: 10Alexandros Kosiaris) [02:04:43] (03CR) 10Springle: bacula: allow mysqldumps to be kept locally (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/132214 (owner: 10Alexandros Kosiaris) [02:07:49] !log xtrabackup db1070 to db1071 [02:07:57] Logged the message, Master [02:12:15] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3788 MB (3% inode=99%): [02:20:15] PROBLEM - Disk space on virt0 is CRITICAL: DISK CRITICAL - free space: /a 3431 MB (3% inode=99%): [02:22:15] PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2065: active_shards: 6194: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 0 [02:22:15] PROBLEM - ElasticSearch health check on elastic1013 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2065: active_shards: 6194: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 0 [02:22:15] PROBLEM - ElasticSearch health check on elastic1008 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2065: active_shards: 6194: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 0 [02:23:15] RECOVERY - ElasticSearch health check on elastic1006 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2066: active_shards: 6197: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0 [02:23:15] RECOVERY - ElasticSearch health check on elastic1008 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2066: active_shards: 6197: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0 [02:23:15] RECOVERY - ElasticSearch health check on elastic1013 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2066: active_shards: 6197: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0 [02:30:35] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Thu May 15 23:29:35 2014 [02:39:44] !log LocalisationUpdate completed (1.24wmf4) at 2014-05-16 02:38:41+00:00 [02:39:49] Logged the message, Master [03:00:15] RECOVERY - Disk space on virt0 is OK: DISK OK [03:09:08] !log LocalisationUpdate completed (1.24wmf5) at 2014-05-16 03:08:04+00:00 [03:09:12] Logged the message, Master [03:21:35] PROBLEM - Puppet freshness on maerlant is CRITICAL: Last successful Puppet run was Tue May 13 21:03:26 2014 [03:55:09] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri May 16 03:54:03 UTC 2014 (duration 54m 2s) [03:55:14] Logged the message, Master [04:00:10] (03CR) 10Ori.livneh: [C: 032] Manage php symlink automatically [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/133528 (https://bugzilla.wikimedia.org/64748) (owner: 10Reedy) [04:00:21] (03Merged) 10jenkins-bot: Manage php symlink automatically [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/133528 (https://bugzilla.wikimedia.org/64748) (owner: 10Reedy) [05:31:35] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Thu May 15 23:29:35 2014 [05:46:25] PROBLEM - RAID on searchidx1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:47:15] RECOVERY - RAID on searchidx1001 is OK: OK: optimal, 1 logical, 4 physical [06:00:25] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Fri May 16 06:00:18 UTC 2014 [06:03:35] RECOVERY - Puppet freshness on maerlant is OK: puppet ran at Fri May 16 06:03:33 UTC 2014 [06:20:23] !log ori synchronized php-1.24wmf4/maintenance/compareParserCache.php 'Ica69a3ef2: Added a script to compare current parser output to cache (no impact on prod; syncing for consistency)' [06:20:28] Logged the message, Master [06:27:35] PROBLEM - HTTP on carbon is CRITICAL: Connection refused [06:28:42] (03PS1) 10Springle: switch dbstore box analytics roles during upgrade [operations/dns] - 10https://gerrit.wikimedia.org/r/133670 [06:29:05] (03CR) 10Springle: [C: 032] switch dbstore box analytics roles during upgrade [operations/dns] - 10https://gerrit.wikimedia.org/r/133670 (owner: 10Springle) [06:32:35] RECOVERY - HTTP on carbon is OK: HTTP OK: HTTP/1.1 200 OK - 232 bytes in 0.001 second response time [07:56:25] PROBLEM - ElasticSearch health check on elastic1003 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2078: active_shards: 6233: relocating_shards: 1: initializing_shards: 1: unassigned_shards: 0 [07:56:25] PROBLEM - ElasticSearch health check on elastic1001 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2078: active_shards: 6233: relocating_shards: 1: initializing_shards: 1: unassigned_shards: 0 [07:56:25] PROBLEM - ElasticSearch health check on elastic1014 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2078: active_shards: 6233: relocating_shards: 1: initializing_shards: 1: unassigned_shards: 0 [07:56:25] PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2078: active_shards: 6233: relocating_shards: 1: initializing_shards: 1: unassigned_shards: 0 [07:56:25] PROBLEM - ElasticSearch health check on elastic1010 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2078: active_shards: 6233: relocating_shards: 1: initializing_shards: 1: unassigned_shards: 0 [07:56:26] PROBLEM - ElasticSearch health check on elastic1011 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2078: active_shards: 6233: relocating_shards: 1: initializing_shards: 1: unassigned_shards: 0 [07:57:25] RECOVERY - ElasticSearch health check on elastic1004 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2079: active_shards: 6234: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0 [07:57:25] RECOVERY - ElasticSearch health check on elastic1014 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2079: active_shards: 6234: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0 [07:57:25] RECOVERY - ElasticSearch health check on elastic1003 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2079: active_shards: 6234: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0 [07:57:25] RECOVERY - ElasticSearch health check on elastic1010 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2079: active_shards: 6234: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0 [07:57:25] RECOVERY - ElasticSearch health check on elastic1001 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2079: active_shards: 6234: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0 [07:57:26] RECOVERY - ElasticSearch health check on elastic1011 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2079: active_shards: 6234: relocating_shards: 1: initializing_shards: 0: unassigned_shards: 0 [08:09:15] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data exceeded the critical threshold [500.0] [08:15:48] <_joe_> that is not good [08:21:15] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [08:42:36] (03CR) 10Alexandros Kosiaris: bacula: allow mysqldumps to be kept locally (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/132214 (owner: 10Alexandros Kosiaris) [08:43:11] (03PS2) 10Ottomata: Bumping up wikimetrics module [operations/puppet] - 10https://gerrit.wikimedia.org/r/133431 (owner: 10Nuria) [08:43:15] (03PS11) 10Alexandros Kosiaris: bacula: allow mysqldumps to be kept locally [operations/puppet] - 10https://gerrit.wikimedia.org/r/132214 [08:43:19] (03CR) 10Ottomata: [C: 032 V: 032] Bumping up wikimetrics module [operations/puppet] - 10https://gerrit.wikimedia.org/r/133431 (owner: 10Nuria) [08:44:00] (03CR) 10Alexandros Kosiaris: Backup role::mariadb::dbstore (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/132215 (owner: 10Alexandros Kosiaris) [08:47:48] (03CR) 10Alexandros Kosiaris: [C: 032] "I developed and tested this in labs as well as giving it a full catalog compilation. Having just merged Sean's last comment I anoint it re" [operations/puppet] - 10https://gerrit.wikimedia.org/r/132214 (owner: 10Alexandros Kosiaris) [08:49:12] springle: ^ [08:49:42] I am amending the role class, doing a catalog compilation and merging/testing after that [08:51:56] <_joe_> akosiaris: if you want, I can grant you access to the puppet compiler machine :) [08:52:11] <_joe_> while I finish building the jenkins slave, I mean [08:52:19] <_joe_> I should be done by ~ today [08:52:29] :-) :-) :-) [08:52:32] <_joe_> if IKEA does not distract me too much [08:52:39] ahahaha [08:53:19] <_joe_> akosiaris: they are mounting the kitchen in my new flat, they were supposed to show up at 9 AM, they arrived at 7.10 AM :| [08:56:45] _joe_: there! they did you a favor. They woke you up before the alarm. What else do you want ? [09:02:50] <_joe_> akosiaris: I was actually pouring coffee when they phoned me :P [09:03:27] so no harm done :P [09:04:35] PROBLEM - Puppet freshness on maerlant is CRITICAL: Last successful Puppet run was Fri May 16 06:03:33 2014 [09:24:38] akosiaris: cool! [09:24:46] let's see how we go... [09:25:55] ah springle since you are here. How do you feel about configuring the backup folder on node level? With a sane default at /srv/backups that is. _joe_ that also goes for ya [09:26:16] node level is fine [09:26:22] so in manifests/site.pp have a node level var saying $mariadb_backups_folder='/a/backups' to override the default [09:34:12] (03PS6) 10Giuseppe Lavagetto: puppet-compiler: module for installation (WiP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/133449 [09:36:36] (03PS1) 10Springle: pool db1071 and db1071 in s1, warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/133686 [09:37:52] (03PS6) 10Alexandros Kosiaris: Introduce role::mariadb::backup [operations/puppet] - 10https://gerrit.wikimedia.org/r/132215 [09:38:17] (03PS2) 10Springle: pool db1070 and db1071 in s1, warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/133686 [09:38:50] (03CR) 10Springle: [C: 032] pool db1070 and db1071 in s1, warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/133686 (owner: 10Springle) [09:38:59] (03Merged) 10jenkins-bot: pool db1070 and db1071 in s1, warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/133686 (owner: 10Springle) [09:40:46] !log springle synchronized wmf-config/db-eqiad.php 'pool db1070 and db1071 in s1, warm up' [09:40:51] Logged the message, Master [09:51:04] (03PS2) 10Giuseppe Lavagetto: compare-puppet-catalogs: minor tweaks [operations/software] - 10https://gerrit.wikimedia.org/r/133505 [09:53:28] (03CR) 10Giuseppe Lavagetto: [C: 032] compare-puppet-catalogs: minor tweaks [operations/software] - 10https://gerrit.wikimedia.org/r/133505 (owner: 10Giuseppe Lavagetto) [09:57:06] (03PS1) 10Springle: reassign db1056 to S4 commonswiki [operations/puppet] - 10https://gerrit.wikimedia.org/r/133688 [09:59:58] (03CR) 10Springle: [C: 032] reassign db1056 to S4 commonswiki [operations/puppet] - 10https://gerrit.wikimedia.org/r/133688 (owner: 10Springle) [10:02:06] (03PS7) 10Giuseppe Lavagetto: puppet-compiler: module for installation (WiP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/133449 [10:02:15] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 2 below the confidence bounds [10:04:45] PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2078: active_shards: 6233: relocating_shards: 0: initializing_shards: 1: unassigned_shards: 0 [10:05:15] rebooted db1056, which was once enwiki master. now only a depooled slave, but still my hands tingle nervously before hitting enter :) [10:05:45] RECOVERY - ElasticSearch health check on elastic1015 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2079: active_shards: 6234: relocating_shards: 0: initializing_shards: 0: unassigned_shards: 0 [10:09:34] that flappy elasticsearch status is weird [10:09:44] manybubbles: any thoughts? [10:09:58] hmm, 6am your time [10:09:58] heheh [10:13:53] !log springle synchronized wmf-config/db-eqiad.php 'reduce db1049 load while cloning' [10:13:58] Logged the message, Master [10:14:09] !log xtrabackup clone db1049 to db1056 [10:14:14] Logged the message, Master [10:21:29] (03PS1) 10Giuseppe Lavagetto: icinga: Fix anomaly detection checks [operations/puppet] - 10https://gerrit.wikimedia.org/r/133692 [10:22:41] <_joe_> this should fix the anomaly detection for now. [10:25:15] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data exceeded the critical threshold [500.0] [10:27:02] <_joe_> this is a bot requesting thumb.php with the "px" in the width [10:36:18] (03PS8) 10Giuseppe Lavagetto: puppet-compiler: module for installation (WiP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/133449 [10:37:15] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [10:42:36] (03PS9) 10Giuseppe Lavagetto: puppet-compiler: module for installation (WiP) [operations/puppet] - 10https://gerrit.wikimedia.org/r/133449 [10:47:15] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [12:05:35] PROBLEM - Puppet freshness on maerlant is CRITICAL: Last successful Puppet run was Fri May 16 06:03:33 2014 [12:09:01] (03CR) 10QChris: "If you want the gerrit->bugzilla bot to add bugzilla" [operations/puppet] - 10https://gerrit.wikimedia.org/r/133589 (https://bugzilla.wikimedia.org/65370) (owner: 10Dzahn) [12:13:05] (03CR) 10QChris: "Forget my above comment. I now see that the integration seems" [operations/puppet] - 10https://gerrit.wikimedia.org/r/133589 (https://bugzilla.wikimedia.org/65370) (owner: 10Dzahn) [12:14:31] (03PS1) 10Ottomata: Slight cleanup of varnish module [operations/puppet/varnish] - 10https://gerrit.wikimedia.org/r/133695 [12:15:55] (03PS7) 10Alexandros Kosiaris: Introduce role::mariadb::backup [operations/puppet] - 10https://gerrit.wikimedia.org/r/132215 [12:17:09] (03PS2) 10Ottomata: Slight cleanup of varnish module [operations/puppet/varnish] - 10https://gerrit.wikimedia.org/r/133695 [12:20:50] (03CR) 10Dan-nl: [C: 04-1] "as far as i understand it, the issue has to do with uploading images > ??mb (maybe 10mb?), which overload the image scalers. previous uplo" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132112 (owner: 10Gergő Tisza) [12:21:00] (03PS3) 10Ottomata: Slight cleanup of varnish module [operations/puppet/varnish] - 10https://gerrit.wikimedia.org/r/133695 [12:31:02] (03CR) 10Alexandros Kosiaris: [C: 032] Introduce role::mariadb::backup [operations/puppet] - 10https://gerrit.wikimedia.org/r/132215 (owner: 10Alexandros Kosiaris) [12:38:00] (03PS1) 10Alexandros Kosiaris: Revert "Introduce role::mariadb::backup" [operations/puppet] - 10https://gerrit.wikimedia.org/r/133696 [12:40:13] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "springle, I totally forgot about the passwords::mysql::dump class that needs to be in private repo before this get merged." [operations/puppet] - 10https://gerrit.wikimedia.org/r/133696 (owner: 10Alexandros Kosiaris) [12:45:28] akosiaris: hmm, so did i [12:45:59] (03CR) 10JanZerebecki: Improve nginx TLS/SSL settings. (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/132393 (https://bugzilla.wikimedia.org/53259) (owner: 10JanZerebecki) [12:47:34] (03CR) 10Filippo Giunchedi: [C: 031] icinga: Fix anomaly detection checks [operations/puppet] - 10https://gerrit.wikimedia.org/r/133692 (owner: 10Giuseppe Lavagetto) [12:49:11] springle: I thought you be asleep or out partying :) [12:57:44] akosiaris: heh. user added to private [12:58:17] nice. I am reverting the revert then :-) [13:08:30] heya _joe_ [13:08:37] i'm trying to use the puppet comparator [13:08:42] but, i'm trying to use it on a submodule change [13:09:21] i can manually change the submodule locally, since I don't think the gui supports comparing change ids from repos other than ops/puppet (does it?) [13:09:22] sorry [13:09:24] not gui [13:09:25] cli [13:10:08] but, can I tell it to to generate catalogs for 2.7 and then compare against the prod one? [13:10:16] or will it always do puppet 3 with no --change arg? [13:13:42] (03PS1) 10Springle: raise db1070 and db1071 to normal load [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/133699 [13:14:02] (03PS1) 10Alexandros Kosiaris: Revert "Revert "Introduce role::mariadb::backup"" [operations/puppet] - 10https://gerrit.wikimedia.org/r/133700 [13:14:16] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Revert "Revert "Introduce role::mariadb::backup"" [operations/puppet] - 10https://gerrit.wikimedia.org/r/133700 (owner: 10Alexandros Kosiaris) [13:15:16] (03CR) 10Springle: [C: 032] raise db1070 and db1071 to normal load [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/133699 (owner: 10Springle) [13:15:23] (03Merged) 10jenkins-bot: raise db1070 and db1071 to normal load [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/133699 (owner: 10Springle) [13:16:22] ottomata: you might be able to become my first hope to knowledge). I want to get some simple stats on el.wikipedia.org (like number of editors, articles, edits, etc etc). Where do I go? http://stats.wikimedia.org ? [13:16:58] Special:Statistics? [13:17:47] !log springle synchronized wmf-config/db-eqiad.php 'raise db1070 and db1071 to normal load' [13:17:51] Logged the message, Master [13:18:55] already been there. a lot of stuff is indeed there already, I am missing some though like number of daily edits (do we even have that metric ? ) [13:19:42] stats.wikimedia.org would be your best bet, i think, but don't know too much about what is there either [13:20:12] http://stats.wikimedia.org/EN/SummaryEL.htm [13:20:15] ok thanks. I see some of the stuff I want also here https://stats.wikimedia.org/EN/ChartsWikipediaEL.htm#5 [13:20:22] ah, you beat me to it :P [13:20:37] thanks ! [13:34:39] (03PS1) 10Alexandros Kosiaris: bacula: Also encrypt the data channel [operations/puppet] - 10https://gerrit.wikimedia.org/r/133702 [13:46:37] (03PS1) 10Alexandros Kosiaris: Add /a/backups fileset to bacula director [operations/puppet] - 10https://gerrit.wikimedia.org/r/133704 [13:46:52] /a? [13:46:57] this will never die will it [13:48:29] (03PS1) 10Springle: repool db1056 in s4, warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/133705 [13:48:48] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Add /a/backups fileset to bacula director [operations/puppet] - 10https://gerrit.wikimedia.org/r/133704 (owner: 10Alexandros Kosiaris) [13:49:05] <_joe_> oh so /a is the dustbin? [13:49:10] <_joe_> I usually use /opt [13:49:21] you get stabbed for both [13:49:22] I've been trying to migrate to /srv [13:49:33] *I* have been trying to migrate to /srv [13:49:38] can you tell how successful I've been at that? [13:49:48] <_joe_> I was about to say "or /srv" [13:50:03] <_joe_> mark: so /srv over /opt? ok [13:50:17] that /a is my fault, not akosiaris' ;) [13:50:27] no /a far predates you ;) [13:50:39] _joe_: well /opt is a bit different technically [13:51:12] well i've resisted changing /a yet for the dbs. so semi my fault [13:51:32] hmm let me add /srv/backups there as well [13:51:42] * matanya doesn't see /a in FHS [13:51:42] <_joe_> mark: yes, when I have some software that does not fit the usual debian way, it goes to /opt, data usually go to /srv [13:51:52] one less excuse :-) [13:52:11] (03CR) 10Springle: [C: 032] repool db1056 in s4, warm up [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/133705 (owner: 10Springle) [13:53:04] just hardlink /a and /srv :) [13:53:18] nobody will ever know [13:53:39] :D [13:54:21] <_joe_> eheh [13:54:35] !log springle synchronized wmf-config/db-eqiad.php 'repool db1056 in s4, warm up' [13:54:40] Logged the message, Master [13:56:46] PROBLEM - LVS HTTPS IPv6 on upload-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection refused [13:56:48] _joe_: yeah [13:57:35] PROBLEM - LVS HTTP IPv6 on upload-lb.esams.wikimedia.org_ipv6 is CRITICAL: Connection refused [13:57:46] heh [13:57:47] that was my thinking exactly [13:57:47] well, Sean would know at some point [13:58:12] hmm [13:58:45] RECOVERY - LVS HTTPS IPv6 on upload-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 649 bytes in 0.453 second response time [13:59:20] (03PS1) 10Filippo Giunchedi: add graphite dashboard to gdash [operations/puppet] - 10https://gerrit.wikimedia.org/r/133707 [13:59:35] RECOVERY - LVS HTTP IPv6 on upload-lb.esams.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 627 bytes in 0.193 second response time [14:00:13] (03PS1) 10Alexandros Kosiaris: Also add /srv/backups [operations/puppet] - 10https://gerrit.wikimedia.org/r/133708 [14:01:09] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Also add /srv/backups [operations/puppet] - 10https://gerrit.wikimedia.org/r/133708 (owner: 10Alexandros Kosiaris) [14:01:57] Reedy: https://bugzilla.wikimedia.org/show_bug.cgi?id=65212 new report at https://de.wikipedia.org/wiki/Wikipedia:Technik/Werkstatt#Internal_error_beim_Export_von_Artikeln [14:02:16] can we grab a stacktrace for this fatal error at 2014-05-16 11:15:47? [14:04:28] se4598: I think there's a dupe of that with a stack trace [14:06:10] Or maybe not [14:08:19] Reedy: there is a similiar https://bugzilla.wikimedia.org/show_bug.cgi?id=39639 but not the same exception and from 2012 (the code may have changed midtime) [14:08:36] bug updated and duped [14:10:14] (03PS1) 10Alexandros Kosiaris: Adding the predump script [operations/puppet] - 10https://gerrit.wikimedia.org/r/133712 [14:12:13] (03CR) 10Alexandros Kosiaris: [C: 032] Adding the predump script [operations/puppet] - 10https://gerrit.wikimedia.org/r/133712 (owner: 10Alexandros Kosiaris) [14:20:14] (03CR) 10Rush: [C: 031] "I don't have gdash to test this, but considering the innocuous change. I looked through everything and I think this could be a a great hol" [operations/puppet] - 10https://gerrit.wikimedia.org/r/133707 (owner: 10Filippo Giunchedi) [14:22:37] (03PS1) 10Alexandros Kosiaris: Fix a typo [operations/puppet] - 10https://gerrit.wikimedia.org/r/133713 [14:22:58] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Fix a typo [operations/puppet] - 10https://gerrit.wikimedia.org/r/133713 (owner: 10Alexandros Kosiaris) [14:26:07] argh [14:26:19] springle: --socket /tmp/mysql.sock ?? [14:26:52] that is how mariadb runs on dbstore1001 [14:29:56] _joe_: you around? [14:31:19] akosiaris: yep? [14:31:33] oh, you hardcoded it? hmm [14:31:43] Ι didn't [14:31:53] I just assumed it would not change that much [14:31:59] ah [14:32:01] (03PS4) 10Ottomata: Slight cleanup of varnish module [operations/puppet/varnish] - 10https://gerrit.wikimedia.org/r/133695 [14:32:12] anyway, fixable by passing the correct parameter to backup::mysqlset [14:32:18] also since we are on that subject [14:32:25] paravoid, not sure if I did it right, but I just ran the puppet comparator on that change ^^, and it doesn't show any diffs! [14:32:37] why not just not pass socket at all? let mysqldump use the system my.cnf [14:32:57] should I call /opt/mariadb-10.0.11-1/bin/mysqldump ? [14:33:06] I don't pass --socket at all [14:33:28] but still it tries to find the /var/run/mysqld/mysql.sock file [14:34:06] ah yes. [mysqld] part has the socket defined [14:34:10] [mysqldump] does not [14:34:22] but [client] does [14:34:25] PROBLEM - graphite.wikimedia.org on tungsten is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 525 bytes in 0.004 second response time [14:34:35] isn't mysql the client ? [14:34:43] aka mysqldump is something else ? [14:34:46] not that would be [mysql] [14:34:52] [client] is any client [14:34:56] supposedly [14:35:08] hmmm [14:35:28] duh, graphite fell over? taking a look [14:36:57] akosiaris: in mysql-predump.erb, we should probably use --defaults-file-extra [14:37:06] yes I just spotted it [14:37:12] ok will do [14:37:20] cool [14:37:27] this is a deja vu btw... [14:37:31] anyway [14:37:34] that should then work for the multi-instance boxes too [14:37:40] which have custom sockets [14:38:25] RECOVERY - graphite.wikimedia.org on tungsten is OK: HTTP OK: HTTP/1.1 200 OK - 1607 bytes in 0.004 second response time [14:38:48] that being said, the /a/backups directory is being backed up as we speak (hopefully successfully) [14:38:57] nice going [14:39:06] that recovered itself btw, at least I didn't take any action [14:40:08] me neither, by the time I was on the box it was ok again [14:40:19] springle: also /opt/mariadb-10.0.11-1/bin/mysqldump, /usr/bin/mysqldump, /usr/local/bin/mysqldump ? (what is that last one doing in the system anyway) [14:40:57] ah, symbolick link to /opt/blahblah [14:41:00] akosiaris: these are running custom mariadb10 packages. everything except the first is a symlink [14:41:44] eeh not exactly. /usr/bin/mysqldump is a file. shipped by the mysql-client-5.5 package [14:41:54] the other one is indeed a symlink [14:42:11] hrrm [14:43:32] must be a left-over. dbstore1001 had stock 5.5 once [14:44:02] hmmm so if we remove it percona-xtrabackup goes with it [14:44:10] grr [14:44:21] this is why the stock packages suck [14:44:30] or yet another reason [14:45:18] ok so, I am passing the parameter to the puppet class to /usr/local/bin/mysqldump just to be on the safe side for now [14:45:31] godog: I honestly think expanding reqstats on graphite a few times gave it a heartattack....it looks like there are 9,915 sub dirs under that heading [14:45:40] that may be somewhat excessive :) [14:47:00] le sigh, for all its convenience the graphite tree is unhelpful with many metrics [14:47:55] akosiaris: i suppose, technically /etc/mysql/conf.f/dumps.conf should use [mysqldump] [14:48:14] though i don't know what else would ever auto-load from conf.d/*cnf [14:49:14] oh the first call the mysql client needs it. sorry. my mistake [14:50:46] which means you'd already proved your own quesiton about [client] earlier :) [14:51:54] haha :-) [14:52:03] true true [14:52:59] <_joe_> ottomata: just back [14:53:58] ah, in meeting, done soon [14:57:53] (03PS1) 10Alexandros Kosiaris: bacula/backups: Use defaults-extra-file instead [operations/puppet] - 10https://gerrit.wikimedia.org/r/133718 [14:58:06] (03CR) 10jenkins-bot: [V: 04-1] bacula/backups: Use defaults-extra-file instead [operations/puppet] - 10https://gerrit.wikimedia.org/r/133718 (owner: 10Alexandros Kosiaris) [15:00:22] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] bacula/backups: Use defaults-extra-file instead [operations/puppet] - 10https://gerrit.wikimedia.org/r/133718 (owner: 10Alexandros Kosiaris) [15:01:15] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 7 below the confidence bounds [15:03:34] <_joe_> ok I'm fed up, merging now :P [15:03:56] (03PS2) 10Giuseppe Lavagetto: icinga: Fix anomaly detection checks [operations/puppet] - 10https://gerrit.wikimedia.org/r/133692 [15:04:21] springle: seems like it works fine now. It will spawn up to 12 mysqldump and use up to 2 cpus for pigzing [15:05:03] will monitor it a bit the next days [15:05:47] akosiaris: ok great [15:06:02] <_joe_> akosiaris, springle: have you ever took a look at mydumper? [15:06:30] _joe_: yes as well as xtrabackup [15:06:31] akosiaris: thank you for your efforts. i think the cron job might have been simpler, but it's great to be finally using th ebacula effort you put in, what, a year ago? [15:06:35] PROBLEM - Puppet freshness on maerlant is CRITICAL: Last successful Puppet run was Fri May 16 06:03:33 2014 [15:06:58] springle: almost... thanks as well for your patience :-) [15:07:10] (03CR) 10Giuseppe Lavagetto: [C: 032] icinga: Fix anomaly detection checks [operations/puppet] - 10https://gerrit.wikimedia.org/r/133692 (owner: 10Giuseppe Lavagetto) [15:07:22] _joe_: yes, mydumper dumping wikis in series was slower than mysqldump dumping in parallel, though it was close [15:07:37] (close enough not to care, actually) [15:07:51] _joe_: I distinctly remember hating the compression implementation in xtrabackup (irrelevant to your question however) [15:08:21] <_joe_> akosiaris: mydumper worked better than mysqldump for me [15:08:46] <_joe_> anyways, if we're not having issues with mysqldump, let's use it :) [15:09:14] :) [15:15:05] *cough* if somebody would merge in mydumper https://bugs.launchpad.net/mydumper/+bug/912432 *cough* [15:16:52] <_joe_> godog: oh yes, that [15:17:47] <_joe_> importance: Undecided → Low [15:17:55] <_joe_> take that! [15:20:10] (03PS10) 10Giuseppe Lavagetto: puppet-compiler: module for installation [operations/puppet] - 10https://gerrit.wikimedia.org/r/133449 [15:20:12] (03CR) 10jenkins-bot: [V: 04-1] puppet-compiler: module for installation [operations/puppet] - 10https://gerrit.wikimedia.org/r/133449 (owner: 10Giuseppe Lavagetto) [15:20:37] (03PS11) 10Giuseppe Lavagetto: puppet-compiler: module for installation [operations/puppet] - 10https://gerrit.wikimedia.org/r/133449 [15:22:40] godog: heh... but maybe hope; mydumper recently had a release, so it might have a future yet [15:22:55] including, just maybe, merges [15:24:43] true, didn't notice the new release [16:18:55] PROBLEM - HTTP on zirconium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:19:16] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [16:22:52] (03PS3) 10QChris: test gerrit->bz notification [operations/puppet] - 10https://gerrit.wikimedia.org/r/133589 (https://bugzilla.wikimedia.org/65370) (owner: 10Dzahn) [16:23:10] what's up with zirconium? [16:24:10] (03PS2) 10Ori.livneh: add graphite dashboard to gdash [operations/puppet] - 10https://gerrit.wikimedia.org/r/133707 (owner: 10Filippo Giunchedi) [16:26:25] !log updated gerrit's hooks-bugzilla plugin to version 2.8.1.2 to allow talking to bugzilla-4.4.4 [16:26:29] Logged the message, Master [16:28:14] (03CR) 10Ori.livneh: [C: 032] add graphite dashboard to gdash [operations/puppet] - 10https://gerrit.wikimedia.org/r/133707 (owner: 10Filippo Giunchedi) [16:35:33] matanya: could you have another look at hewiki in beta? I've substituted the hebrew analyzer for the icu normalizer. it should be less crashy. [16:35:44] hopefully it is just as good at finding stuff [16:36:06] because we weren't using the hebrew analyzers lematization - just the normalization [16:36:49] I'm having trouble getting http://etherpad.wikimedia.org to load. [16:36:58] Does anyone know who I should ping? [16:42:45] RECOVERY - HTTP on zirconium is OK: HTTP OK: HTTP/1.1 302 Found - 518 bytes in 0.046 second response time [16:43:00] im looking into it but seems to be also coming back up [16:43:01] (03PS1) 10QChris: Update hoobs-bugzilla to 5edd392d926daaa58917b1c8bb174cdb022e4c76 [operations/gerrit/plugins] - 10https://gerrit.wikimedia.org/r/133732 (https://bugzilla.wikimedia.org/65370) [16:43:21] (03PS2) 10QChris: Update hooks-bugzilla to 5edd392d926daaa58917b1c8bb174cdb022e4c76 [operations/gerrit/plugins] - 10https://gerrit.wikimedia.org/r/133732 (https://bugzilla.wikimedia.org/65370) [16:43:34] well... now its working. [16:43:41] and i just logged in and did nothing [16:44:16] it was in the middle of a puppet run not 5 minutes ago though [16:44:35] !log i logged into zirconium, but it had recovered by the time I checked it. [16:44:39] Logged the message, RobH [16:44:43] !log partial zirconium downtime [16:44:48] Logged the message, RobH [16:47:59] if anyone has a moment and insight into a host on in the 'unknown status' section here: https://etherpad.wikimedia.org/p/diamond-deployment [16:48:05] don't hesitate to make a note :) [16:50:19] (03CR) 10Chad: [C: 032] Remove old WikiEditor settings [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132793 (owner: 10TheDJ) [16:51:42] (03Merged) 10jenkins-bot: Remove old WikiEditor settings [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132793 (owner: 10TheDJ) [16:53:12] !log demon synchronized wmf-config/CommonSettings.php 'Removing old WikiEditor settings' [16:53:17] Logged the message, Master [16:53:17] <^d> thedj: Done ^ [16:55:10] !log "in place" reindexing (for cirrus) all the wikipedias after the deploy train hit them yesterday [16:55:15] Logged the message, Master [17:22:46] (03CR) 10CSteipp: [C: 032] "Deploy to beta" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126185 (https://bugzilla.wikimedia.org/59141) (owner: 10CSteipp) [17:24:57] (03Merged) 10jenkins-bot: Temporarily allow insecure token trasfer for OAuth [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/126185 (https://bugzilla.wikimedia.org/59141) (owner: 10CSteipp) [17:25:15] (03PS1) 10Chad: GeoData: switch all wikivoyages to using Elastic backend [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/133749 [17:26:00] ou [17:43:25] Nemo_bis: ou? [17:43:42] (03CR) 10Chad: [C: 032] GeoData: switch all wikivoyages to using Elastic backend [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/133749 (owner: 10Chad) [17:45:09] (03PS1) 10Gergő Tisza: Add sampling control setting for MediaViewer event logging [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/133750 [17:45:30] (03Merged) 10jenkins-bot: GeoData: switch all wikivoyages to using Elastic backend [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/133749 (owner: 10Chad) [17:45:40] As in: Oh! Nice. [17:46:19] ^d: do I have perms to do a force push to the phabricator repo's? need to import upstream, I think they are blank slates now [17:47:02] !log demon synchronized wmf-config/CommonSettings.php 'GeoData to Elastic for all wikivoyages' [17:47:05] Logged the message, Master [17:47:52] <^d> chasemp: On it now, hadn't done the acl yet. [17:47:58] ^d: thank you sir [17:50:48] <^d> Done. [17:51:15] by my calculation that means you have at least another hour before my next gerrit question :) [17:51:54] <^d> I'll set my watch :p [18:00:06] hi _joe_, I hear you "updated the anomaly detection graphite alerting scripts" from ori [18:00:24] and we (analytics) has a need for that (maybe?) [18:00:48] EventLogging status is tracked in Icinga but we'd like to alert if the volume of EventLogging events goes above a certain threshold [18:02:23] going to push a config change to lower the rate at which multimedia events are logged, greg-g okayed the deploy [18:02:35] (03PS2) 10Ori.livneh: Add sampling control setting for MediaViewer event logging [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/133750 (owner: 10Gergő Tisza) [18:02:38] (03CR) 10Ori.livneh: [C: 032] Add sampling control setting for MediaViewer event logging [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/133750 (owner: 10Gergő Tisza) [18:02:49] (03Merged) 10jenkins-bot: Add sampling control setting for MediaViewer event logging [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/133750 (owner: 10Gergő Tisza) [18:04:15] !log ori updated /a/common to {{Gerrit|Ia43821231}}: Add sampling control setting for MediaViewer event logging [18:04:19] Logged the message, Master [18:07:21] !log ori synchronized wmf-config 'Ia43821231: Add sampling control setting for MediaViewer event' [18:07:26] Logged the message, Master [18:07:35] PROBLEM - Puppet freshness on maerlant is CRITICAL: Last successful Puppet run was Fri May 16 06:03:33 2014 [18:08:07] (03PS1) 10Gergő Tisza: Tweak MediaViewer sampling settings [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/133756 [18:08:14] (03CR) 10jenkins-bot: [V: 04-1] Tweak MediaViewer sampling settings [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/133756 (owner: 10Gergő Tisza) [18:08:46] (03PS2) 10Ori.livneh: Tweak MediaViewer sampling settings [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/133756 (owner: 10Gergő Tisza) [18:08:51] (03CR) 10Ori.livneh: [C: 032] Tweak MediaViewer sampling settings [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/133756 (owner: 10Gergő Tisza) [18:08:59] (03Merged) 10jenkins-bot: Tweak MediaViewer sampling settings [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/133756 (owner: 10Gergő Tisza) [18:12:44] !log ori synchronized wmf-config/InitialiseSettings.php 'I3c453b0949f4e: Tweak MediaViewer sampling settings' [18:12:47] Logged the message, Master [18:13:24] (03PS1) 10Chad: GeoData: Switch all wikis to using Elastic as backend [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/133759 [18:13:51] (03CR) 10Chad: [C: 04-2] "Don't merge yet, just prepping for next week." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/133759 (owner: 10Chad) [18:16:14] milimetric: I think _joe_ is off for the day, drop a pm for monday? [18:16:25] thanks chasemp, will do [18:26:45] PROBLEM - ElasticSearch health check on elastic1015 is CRITICAL: CRITICAL - elasticsearch (production-search-eqiad) is running. status: red: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2093: active_shards: 6278: relocating_shards: 2: initializing_shards: 1: unassigned_shards: 0 [18:27:45] RECOVERY - ElasticSearch health check on elastic1015 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 16: number_of_data_nodes: 16: active_primary_shards: 2094: active_shards: 6279: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [18:28:11] (03PS1) 10Dzahn: create account for gtisza (tgr) [operations/puppet] - 10https://gerrit.wikimedia.org/r/133761 [18:31:25] (03PS2) 10Dzahn: create account for gtisza (tgr) [operations/puppet] - 10https://gerrit.wikimedia.org/r/133761 [18:37:17] RobH, sorry I took off after reporting etherpad issues. I just wanted to pop back in to say thanks for looking into it. [18:37:27] WFM now. [18:46:38] halfak: no need to apologize, and it fixed itself! [18:46:55] but yea, there is now a followup task to see why, and to install monitoring paging for services like blog and etherpad [18:47:07] so its sparked off some positive followup [18:47:13] * halfak nods  [18:47:17] Good deal. :) [18:59:36] stupid service check [19:17:37] (03PS1) 10Manybubbles: Update highlighter to 0.0.9 [operations/software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/133771 [19:20:57] (03CR) 10Manybubbles: "Not sure when I want to deploy this given I'll be on a place for a bunch of hours next week and I'll be at a conference for the beginning " [operations/software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/133771 (owner: 10Manybubbles) [19:52:27] (03CR) 10Gergő Tisza: [C: 031] create account for gtisza (tgr) [operations/puppet] - 10https://gerrit.wikimedia.org/r/133761 (owner: 10Dzahn) [20:16:00] (03Abandoned) 10Hashar: jobrunner: reduce polling on beta cluster [operations/puppet] - 10https://gerrit.wikimedia.org/r/123444 (owner: 10Hashar) [20:37:38] bd808: Hey, can you help me out with a deployment thing in beta labs? [20:37:58] RoanKattouw: I can try. What's up? [20:38:07] I'm lazy and don't want to spend hours setting up TimedMediaHandler and scalers and crap so I wanted to deploy a proposed fix to beta to test it there [20:38:43] But sync-file doesn't seem to work right [20:38:55] Ah. Probably not. [20:38:58] And the copy of the file I was working on seems to have been destroyed while I was at lunch [20:39:31] Let me look at the prep job in Jenkins. It may do a hard reset every 10 minutes [20:39:40] (03CR) 10Mattflaschen: [C: 031] "Good to go, will merge during deploy." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/133442 (owner: 10Phuedx) [20:39:53] That sounds reasonable [20:39:54] In any case [20:40:09] On deployment-bastion, I have a file /home/catrope/TextHandler.hpp [20:40:10] *php [20:40:23] I need that file to appear at extensions/TimedMediaHandler/handlers/TextHandler/TextHandler.php [20:40:30] in the code running on beta labs [20:40:33] RoanKattouw: there is a vagrant role for TMH [20:40:50] I don't care how you do it or tell me to do it or which gods I have to make human sacrifices to, I just want that code to run [20:40:53] should set up everything in theory [20:42:10] (and by "in theory" I mean "I am probably the only person who ever used it" but it would be nice if more people gave it a try) [20:42:32] Ok, the update job (beta-code-update-eqiad) seems to just `git pull`, but it follows that up with `git submodule update --init --recursive` so we need to disable the update job to test hot patches. [20:42:56] Which is easy enough to do if it won't take long to test. [20:43:45] RoanKattouw: Do you want to try hacking beta or give MW-Vagrant a shot first? [20:44:30] bd808: Let's hack beta real quick [20:44:34] I only need it to be there for like 5 mins [20:45:08] Works for me. I'll beta-swat it for you :) [20:45:16] Awesome :) [20:45:33] Ping me when it's up. All I need to do is make one HTTP request and I'll know [20:49:49] RoanKattouw: It's ready [20:50:33] bd808: And it WORKS [20:50:34] Sweet! [20:50:36] Thanks man [20:50:40] I'll go put that patch in Gerrit pronto [20:50:48] In the meantime feel free to discard that change [20:52:07] Nice. For future reference I did 1) disable beta-code-update-eqiad in Jenkins gui. 2) Change /srv/scap-stage-dir/php-master on deployment-bastion as appropriate. 3) /usr/local/bin/wmf-beta-scap [20:52:44] (03PS1) 10Ori.livneh: Remove cleanupipc cronjob [operations/puppet] - 10https://gerrit.wikimedia.org/r/133831 [20:53:25] Aha [20:53:39] I would have been happy to skip #1 (only needed it briefly) but I didn't know #3 [20:53:46] Now I know :) [20:54:00] RoanKattouw, I think before when I've wanted to check stuff like this I've just patched it on both apaches, heh. [20:54:09] That wrapper script is the real magic [20:55:02] sync-* should work there too if you manually setup an ssh-agent like /usr/local/bin/wmf-beta-scap does [20:55:15] But a full scap takes like 45 seconds :) [20:56:07] Nice [20:56:12] That's better than 45 minutes [20:56:19] Much [20:56:46] It takes longer if there is an l10n update but usually not more than 8 or 9 minutes [20:57:38] (stupid l10nupdate) [20:57:52] Stupid large binary cache format [21:00:38] (03CR) 10Ori.livneh: "Aaron figured it out -- it's https://lists.debian.org/debian-isp/2002/04/msg00041.html" [operations/puppet] - 10https://gerrit.wikimedia.org/r/133831 (owner: 10Ori.livneh) [21:01:53] (03PS1) 10Manybubbles: Install kuromoji [operations/software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/133833 [21:02:47] that's impressive, quoting a debian-isp message from 2002. [21:02:56] referencing* [21:05:28] heh [21:06:03] <_joe_> ori: that's the classical thing from the past no-one ever removed because everyone forgot its purpose :) [21:06:42] yeah, we have lots of those [21:06:54] i hope to clear up some of the cobwebs from the mediawiki setup specifically [21:07:44] for example, mediawiki runs under user 'apache', even though apache's user in debian/ubuntu is www-data -- my best guess for why that's the case is that it was a carryover from the fedora core days [21:08:35] PROBLEM - Puppet freshness on maerlant is CRITICAL: Last successful Puppet run was Fri May 16 06:03:33 2014 [21:19:41] AaronSchulz: ^^ pinging you since one of gadgets not working is Twinkle [21:19:45] er, wrong channel [21:24:17] (03CR) 10Manybubbles: [C: 04-1] "Just for beta right now" [operations/software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/133833 (owner: 10Manybubbles) [21:33:38] !log ori synchronized php-1.24wmf5/extensions/MultimediaViewer 'Update MultimediaViewer for I0df067a61: Add sampling to unsampled event logging' [21:33:42] Logged the message, Master [21:37:05] !log ori synchronized php-1.24wmf4/extensions/MultimediaViewer 'Update MultimediaViewer for I0df067a61: Add sampling to unsampled event logging' [21:37:09] Logged the message, Master [21:41:33] (03PS1) 10Manybubbles: Cirrus as default for zh_yuewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/133840 [21:46:36] (03CR) 10Chad: [C: 031] Cirrus as default for zh_yuewiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/133840 (owner: 10Manybubbles) [22:31:35] PROBLEM - Puppet freshness on db1007 is CRITICAL: Last successful Puppet run was Fri May 16 19:30:32 2014 [22:34:16] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 21.43% of data exceeded the critical threshold [500.0] [22:37:16] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 4 below the confidence bounds [22:38:29] this thing still on? [22:38:40] so i just got paged [22:38:49] whats up with tungsten? [22:38:50] wmf irc splits from like, everyone, and we got a huge 5xx spike around the same time [22:40:33] So bblack you know much about graphite? [22:40:38] nope :) [22:40:50] hrmm, as the monitoring host i tend to not wanna poke too much and break monitoring logs [22:41:06] I know those two alerts have been going crazy for days and people have been working on them and the alerts usually aren't "real" [22:41:18] but, those look a bit real [22:41:20] urgh, an ops list email about it would be nice...... [22:41:29] unless there was one and i didnt read [22:41:32] * RobH checks [22:42:11] hrmm, nope, but i can see gerrit changes pointing more things at it [22:42:30] the 5xx spike in graphite subsided fairly quickly on its own [22:42:44] that + irc split around the same time makes me think network hiccup [22:42:47] https://graphite.wikimedia.org/render/?title=HTTP%205xx%20Responses%20-8hours&from=-8hours&width=1024&height=500&until=now&areaMode=none&hideLegend=false&lineWidth=2&lineMode=connected&target=color%28cactiStyle%28alias%28reqstats.5xx,%225xx%20resp/min%22%29%29,%22blue%22%29 [22:45:34] heh, seems a lot of folks are using dickson? [22:45:36] dunno [22:46:33] I like dickson :) [22:47:08] bblack: i like that we are using the graphite host to look at graphite graphs of the graphite host [22:47:16] it seems right [22:47:35] (not really, but meh ;) [22:49:55] PROBLEM - Disk space on labsdb1002 is CRITICAL: DISK CRITICAL - free space: /a 124772 MB (3% inode=99%): [22:51:33] greg-g: The Vector skin issues you mentioned me on in #wikimedia-tech and #wikimedia-qa are probabaly result of one of two changes: 1) removal of collapsibleNav; if something depended on that visually, it's probably just user scripts that did something hacky and questionable in the first place, they'll figure out another way to do what they did whcih is how they did it in the first place, 2) re [22:51:33] lated to css refactoring in mw-body-content, afaik that should not have any effect on existing code [22:52:03] :/ [22:52:38] Neither were my changes. I could look into it tomorrow. I'm out of the country at the moment, settling in at a family residence for the weekend [22:52:55] If you need me to look into it, file a bug assigned to me and repeat those links, irc not buffered at the moment. [22:53:03] I'll get to it first thing tomorrow. [22:53:06] kk [22:53:14] ty [22:53:50] Krinkle: nah [22:53:53] MatmaRex: ^ I'll assume you're on point for now, I have to leave in about 15, but if you can't figure it out, could you report the bug and assign to Krinkle [22:54:06] Nemo_bis did the collapsible nav thing, but it got merged ages ago [22:54:09] Krinkle: ext.gadget.* modules are not loading (sometimes), load.php reports them as "Problematic modules" [22:54:47] !log aaron synchronized php-1.24wmf5/includes/db/Database.php '8829ffc72d3332d348a1a2e58d525e54e126bad5' [22:55:00] Logged the message, Master [22:55:13] Krinkle: e.g. https://bits.wikimedia.org/en.wikipedia.org/load.php?debug=false&lang=en&modules=ext.gadget.BugStatusUpdate&only=styles&skin=vector&whatever yields "Problematic modules: {"ext.gadget.BugStatusUpdate":"missing"}" [22:55:35] (keep changing the 'whatever' part if you don't see it) [22:55:37] https://gerrit.wikimedia.org/r/#/c/131259/ was that bit [22:56:27] !log aaron synchronized php-1.24wmf4/includes/db/Database.php '182e42c173b9ab0c2bc5d753879a000b1ff39e77' [22:56:31] Logged the message, Master [22:56:40] https://gerrit.wikimedia.org/r/#/c/131402/ was the other one [22:56:44] (note, that gadget seems to have no styles, so an empty page at that URL is the correct behavior) [22:57:15] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [22:58:04] my wild guess is that Gadget::loadStructuredList() is not returning the correct results, but i can't debug that [22:58:55] RECOVERY - Disk space on labsdb1002 is OK: DISK OK [22:59:16] MatmaRex: Krinkle can't help right now, could you summarize what you know/think in a bug report, please? I'd be much apliged. [22:59:41] i just summarized all i know. tl;dr gadget modules are reported as 'missing' in load.php [23:00:28] a real-life URL that exhibits the issue is e.g. [23:00:38] https://bits.wikimedia.org/en.wikipedia.org/load.php?debug=false&lang=en&modules=DRN-wizard%2CReferenceTooltips%2CWatchlistChangesBold%2Ccharinsert%2Cedittop%2CmySandbox%2CrefToolbar%2Csearch-new-tab%2Cteahouse%7Cext.geshi.language.css%2Chtml4strict%2Cjavascript%2Ctext%7Cext.geshi.local%7Cext.uls.nojs%7Cext.visualEditor.viewPageTarget.noscript%7Cext.wikihiero%7Cmediawiki.legacy.commonPrint%2Cshared%7Cmediawiki.skinning.interface [23:00:38] %7Cmediawiki.ui.button%7Cmw.PopUpMediaTransform%7Cskins.vector.styles%7Cwikibase.client.init&only=styles&skin=vector&* [23:03:15] I might add that I cannot reproduce it on Commons [23:04:06] nor on Meta [23:05:05] Popups does work for me, but without any CSS [23:05:13] the JavaScript part appears to be OK [23:06:04] I can try to summarize later tonight, but I make no promise on how quickly. [23:06:16] MatmaRex: Problematic modules [23:06:16] MatmaRex: Interesitng, that can't be a new issue [23:06:22] MatmaRex: The only thing new in recent wmf branches is that it actually says they're problematic [23:06:25] previously they'd be omitted. [23:06:31] so the actual issue must be something else [23:06:33] which may or may not be new [23:06:34] i know [23:06:36] yet it's happening [23:06:49] and the gadgets-not-loading *issue* is new [23:07:05] I need more context, what's this about? [23:07:14] I only saw "Vector sans-serif" [23:07:17] Krinkle: gadgets do not load on en.wp, sometimes [23:07:28] or load partially [23:07:38] deterministic? [23:07:42] everything else works, or mostly works (as usually) [23:07:45] Krinkle: nope [23:07:59] happens in about 20% of tries for me [23:08:15] steps to reproduce it? [23:08:24] Krinkle: look at https://bits.wikimedia.org/en.wikipedia.org/load.php?debug=false&lang=en&modules=ext.gadget.BugStatusUpdate&only=styles&skin=vector&whatever [23:08:29] change the 'whatever' part until you get the error [23:08:35] (empty page = correct behavior in this case) [23:08:45] interesting [23:08:48] No [23:08:50] took me 4 tries right now [23:08:59] empty page means the server serving it is using an old version of mediawiki core [23:09:07] non-empty page with just that comment is the same exact error [23:09:13] but with the new code that makes it say that [23:09:18] um? [23:09:30] empty page certainly isn't your expected behaviour? [23:09:41] it is, that module has no styles [23:09:52] ? [23:09:55] here's a real-life URL if you prefer [23:09:56] https://bits.wikimedia.org/en.wikipedia.org/load.php?debug=false&lang=en&modules=DRN-wizard%2CReferenceTooltips%2CWatchlistChangesBold%2Ccharinsert%2Cedittop%2CmySandbox%2CrefToolbar%2Csearch-new-tab%2Cteahouse%7Cext.geshi.language.css%2Chtml4strict%2Cjavascript%2Ctext%7Cext.geshi.local%7Cext.uls.nojs%7Cext.visualEditor.viewPageTarget.noscript%7Cext.wikihiero%7Cmediawiki.legacy.commonPrint%2Cshared%7Cmediawiki.skinning.interface [23:09:57] %7Cmediawiki.ui.button%7Cmw.PopUpMediaTransform%7Cskins.vector.styles%7Cwikibase.client.init&only=styles&skin=vector&* [23:10:37] uh, or not [23:10:39] OK, so what is the problem? Gadget is making http requests for styles always, and becayse they don't contain styles, it skips them. Previously that meant only=styles for 5 gadgets returning 3 stylesheets, and now 3 stylesheets plus a comment listing the two it skipped [23:10:47] sorry [23:10:56] https://bits.wikimedia.org/en.wikipedia.org/load.php?debug=false&lang=en&modules=ext.echo.badge%7Cext.gadget.BugStatusUpdate%2CDRN-wizard%2CReferenceTooltips%2CWatchlistChangesBold%2Ccharinsert%2Cedittop%2CmySandbox%2CrefToolbar%2Csearch-new-tab%2Cteahouse%7Cext.geshi.language.css%2Chtml4strict%2Cjavascript%2Ctext%7Cext.geshi.local%7Cext.uls.nojs%7Cext.visualEditor.viewPageTarget.noscript%7Cext.wikihiero%7Cmediawiki.legacy.commo [23:10:56] nPrint%2Cshared%7Cmediawiki.skinning.interface%7Cmediawiki.ui.button%7Cmw.PopUpMediaTransform%7Cskins.vector.styles%7Cwikibase.client.init&only=styles&skin=vector&* [23:10:57] and if you load it in isolation like the url you gave me, it says its problematic (e.g. requested but not found) [23:11:05] the only=scripts equiv seems to be working [23:11:10] I'm missing something [23:11:11] zoom out [23:11:15] what is the user facing bug? [23:11:54] Gadgets do not load or load partially. [23:11:54] Krinkle: gadgets don't load. [23:12:13] MatmaRex: The comment being there or the page being empty is not related afaics. [23:12:18] on the English Wikipedia, I haste to add [23:12:34] What made you connect that behaviour? [23:12:50] Krinkle: http://i.imgur.com/fuTBAvo.png [23:13:08] ext.gadget.ReferenceTooltips definitely has styles [23:13:12] Right [23:13:19] https://bits.wikimedia.org/en.wikipedia.org/load.php?debug=false&lang=en&modules=ext.echo.badge%7Cext.gadget.BugStatusUpdate%2CDRN-wizard%2CReferenceTooltips%2CWatchlistChangesBold%2Ccharinsert%2Cedittop%2CmySandbox%2CrefToolbar%2Csearch-new-tab%2Cteahouse%7Cext.geshi.language.css%2Chtml4strict%2Cjavascript%2Ctext%7Cext.geshi.local%7Cext.uls.nojs%7Cext.visualEditor.viewPageTarget.noscript%7Cext.wikihiero%7Cmediawiki.legacy.commo [23:13:20] nPrint%2Cshared%7Cmediawiki.skinning.interface%7Cmediawiki.ui.button%7Cmw.PopUpMediaTransform%7Cskins.vector.styles%7Cwikibase.client.init&only=styles&skin=vector&*qbra [23:13:24] that's this URL [23:13:28] k, never mind. [23:13:39] "Problematic modules" is not output if the module exists but only has scripts [23:13:43] so that was a good example of yours [23:13:54] as i said, happens non-deterministically, so i added some junk at the end until it showed the error [23:13:59] yeah [23:14:13] MatmaRex: Were you able to tie it to a server? [23:14:31] if you do the invalidation buster query, look at Network in the mean time in Chrome and look at the served by header [23:14:35] doing tha tmyself now as well [23:14:36] didn't try. does load.php let me tie the output to a server? [23:14:41] Should yeah [23:14:46] the one that served it to varnish [23:14:51] I'll check debug logs meanwhile [23:16:55] mw1104, mw1105, mw1126, mw1057 [23:18:02] mw1113, mw1087, mw1047 [23:18:06] so all kinds [23:19:28] twkozlowski: those aren't bits servers [23:19:38] the load.php request, not the /wiki request [23:20:03] oh, sorry, right [23:20:27] These are bits servers https://github.com/wikimedia/operations-puppet/blob/production/files/dsh/group/bits [23:20:33] e.g. cp3019 [23:21:22] Krinkle: anything related to servers in the headers is "X-Cache: cp1070 miss (0), cp3020 miss (0)" [23:21:31] (that's on a problematic URL) [23:21:38] no "served by" or anything [23:22:03] RoanKattouw: Hm.. cp3019 is a varnish. Are any of the servers in the dsh bits group application servers? [23:22:16] If not, which are, and are they exposed in request headerS? [23:22:27] X-Cache:cp1056 miss (0), cp3019 miss (0) [23:22:34] Maybe the 10xx is an app server [23:22:53] The bits group is the bits Varnishes I think [23:22:57] cp = caching proxy [23:23:00] right [23:23:02] 10xx = eqiad, 30xx = esams [23:23:28] https://github.com/wikimedia/operations-puppet/blob/a15c8062d2ff/manifests/role/cache.pp#L76-L79 [23:23:29] The app server is only shown in the comment in the HTML output [23:23:31] Oh [23:23:45] Not sure if we ever got around to including that for load.php, or set a header, or what [23:23:46] yeah, 10xx eqiad, 30xx esams [23:24:10] 'Bits application servers eqiad' => 'mw1151.eqiad.wmnet mw1152.eqiad.wmnet', [23:24:17] only 2? [23:24:25] No those are the aggregators [23:24:28] See Ganglia for the full list [23:25:13] mw1149, mw1150, mw1151, mw1152 [23:25:29] yeah 4, that's what I remember [23:25:36] $ cat resourceloader.log | grep -v 'New definition hash' | grep -v 'request for private module' [23:25:37] zaroo results [23:25:43] that's good, in a way [23:25:46] but also not so useful [23:30:05] RECOVERY - Puppet freshness on db1007 is OK: puppet ran at Fri May 16 23:30:00 UTC 2014 [23:32:00] Krinkle: unless you have better ideas, i'd check if Gadget::loadStructuredList is sometimes returning null when it shouldn't be, and if yes, why is it doing that [23:32:59] MatmaRex: yeah, I'm mw-evalling now [23:32:59] i think i'm off to get some sleep, good luck debugging [23:33:07] ugh, I said that an hour ago [23:33:12] :) [23:33:16] night night :) [23:33:18] (oh, then i'll hang on for a while to see what happens) [23:33:20] hah [23:33:32] A few minutes would be nice :) [23:35:36] Gadget::loadStructuredList() and the underlying memcached object is fine [23:35:46] at least not critical, let me inspect it [23:36:24] MatmaRex: Hm.. would you say it's all or none for gadgets? [23:36:33] or have you gotten a request with some gadgets [23:36:37] (a single load.php request) [23:37:59] Hi, has anyone here been trained on Fundraising cluster boxen? [23:38:35] Krinkle: seems to be all-or-nothing, but i haven't really looked at this [23:39:19] i just reproduced twice and no gadget modules were loaded in both cases [23:39:38] (i failed to reproduce a few times and everything was loaded in these cases) [23:40:05] (i'm comparing the list given in the URL to the list of problematic ones) [23:41:21] I think I'm going to live patch [23:41:54] i love the smell of live patches on friday night! [23:41:58] MatmaRex: Can you file a bug? [23:42:08] Just a few links and what you found so far. [23:42:38] i bet someone filed at least one already [23:42:49] there it is. https://bugzilla.wikimedia.org/show_bug.cgi?id=65424 for example [23:43:55] (seems to be the only one) [23:44:56] Krinkle: it's too late here for coherent thought ;) i can clean up and copy-paste the logs from here [23:47:18] RoanKattouw: Can you sanity check and/or tell me who I should ask? [23:47:34] https://gerrit.wikimedia.org/r/#/c/133871/ [23:47:42] I'd like to deploy that oneline debugger (adds wfHostname() to load.php error responses) [23:48:01] haven't been in this position since the svn days [23:48:10] rl always works :) [23:55:57] (03CR) 10Gergő Tisza: "Do they necessarily know how expensive it will be to thumbnail the files, though? (File sizes are part of that, but not the only factor; f" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/132112 (owner: 10Gergő Tisza) [23:56:47] ori: Unapplied submodule update in 1.24wmf4 for EventLogging [23:58:12] Krinkle: d'oh, I'll sync [23:58:44] ori: Can you sanity check https://gerrit.wikimedia.org/r/#/c/133871/ before I sync that in a few?