[00:07:44] AaronS: ok, rbf1001-1002 online =] [00:15:32] (03CR) 10coren: [C: 032] webnode.pp: Install LUA support for lighttpd (lighttpd-mod-magnet) [operations/puppet] - 10https://gerrit.wikimedia.org/r/150712 (https://bugzilla.wikimedia.org/68614) (owner: 10Hedonil) [00:15:49] (03CR) 10coren: [V: 032] "Why missing +2V?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/150712 (https://bugzilla.wikimedia.org/68614) (owner: 10Hedonil) [00:18:09] (03PS1) 10Bsitu: Enable job queue to process notification on all wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150742 [00:32:14] springle, yt? I noticed that every index on tag_summary is duplicated - it it intended? [00:34:30] *in production, not tables.sql :) [00:36:36] old and new? [00:36:40] one just not deleted? [00:39:50] MaxSem: on enwiki, yeah i'd noticed. needs to be fixed on a slave-by-slave basis when depooled, since no PK for online change [01:32:53] PROBLEM - Puppet freshness on labsdb1004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 00:00:00 UTC [02:06:28] !log labsdb1001 migrating to mariadb 10, expect read-only and downtime, see labs-l [02:06:36] Logged the message, Master [02:20:21] !log LocalisationUpdate completed (1.24wmf14) at 2014-07-31 02:19:17+00:00 [02:20:27] Logged the message, Master [02:21:27] RECOVERY - mysqld processes on labsdb1001 is OK: PROCS OK: 1 process with command name mysqld [02:36:33] !log LocalisationUpdate completed (1.24wmf15) at 2014-07-31 02:35:29+00:00 [02:36:38] Logged the message, Master [02:39:20] PROBLEM - mysqld processes on labsdb1001 is CRITICAL: PROCS CRITICAL: 2 processes with command name mysqld [02:54:08] (03PS1) 10Withoutaname: Remove deprecated $wgHTCPMulticastAddress and replace with $wgHTCPRouting [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150759 [03:18:25] PROBLEM - Disk space on vanadium is CRITICAL: DISK CRITICAL - free space: / 4273 MB (3% inode=94%): [03:19:35] !log LocalisationUpdate ResourceLoader cache refresh completed at Thu Jul 31 03:18:07 UTC 2014 (duration 18m 6s) [03:19:40] Logged the message, Master [03:34:00] PROBLEM - Puppet freshness on labsdb1004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 00:00:00 UTC [04:11:58] PROBLEM - Disk space on elastic1016 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 19517 MB (3% inode=99%): [04:21:53] PROBLEM - Puppet freshness on labsdb1001 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 02:21:25 UTC [04:28:51] wmflabs services that access the database seem to be experiencing MySQL errors. [04:29:44] https://tools.wmflabs.org/sigma/editorinteract.py and other tools give errors like "Access denied for user 's51469'@'10.68.17.123' (using password: YES)" [04:29:59] James_F|Away / andrewbogott_afk ping [04:30:29] LFaraone: [19:06:28] !log labsdb1001 migrating to mariadb 10, expect read-only and downtime, see labs-l [04:30:42] yes, these are... read operations, no? [04:31:05] yes, he also said "downtime" though :P [04:31:57] that was... several hours ago, though, right? [04:33:47] About 2 and a half yeah, but I guess the migration will take some time [04:47:44] PROBLEM - Disk space on elastic1016 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 19552 MB (3% inode=99%): [04:52:50] (03PS1) 10Springle: Remove mysql_multi_instance from labsdb1001 after migration. [operations/puppet] - 10https://gerrit.wikimedia.org/r/150764 [04:55:47] (03CR) 10Springle: [C: 032] Remove mysql_multi_instance from labsdb1001 after migration. [operations/puppet] - 10https://gerrit.wikimedia.org/r/150764 (owner: 10Springle) [05:02:44] PROBLEM - Disk space on elastic1016 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 19877 MB (3% inode=99%): [05:05:23] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [05:08:30] RECOVERY - Puppet freshness on labsdb1001 is OK: puppet ran at Thu Jul 31 05:08:20 UTC 2014 [05:18:20] PROBLEM - Disk space on elastic1016 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 20164 MB (3% inode=99%): [05:33:21] PROBLEM - Disk space on elastic1016 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 20049 MB (3% inode=99%): [05:34:09] PROBLEM - Puppet freshness on labsdb1004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 00:00:00 UTC [05:54:09] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 03:53:09 UTC [06:29:39] RECOVERY - Disk space on vanadium is OK: DISK OK [06:33:12] (03PS6) 10Ori.livneh: Add HHVM module [operations/puppet] - 10https://gerrit.wikimedia.org/r/150506 [06:33:29] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Thu Jul 31 06:33:27 UTC 2014 [06:42:07] (03PS7) 10Ori.livneh: Add HHVM module [operations/puppet] - 10https://gerrit.wikimedia.org/r/150506 [06:42:11] PROBLEM - puppet last run on virt0 is CRITICAL: CRITICAL: Puppet has 1 failures [06:50:11] !log labsdb1001 migration complete, should be all systems go [06:50:17] Logged the message, Master [06:50:25] LFaraone: ^ [06:56:00] PROBLEM - puppet last run on ssl1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:58:09] Cool [06:59:19] RECOVERY - puppet last run on virt0 is OK: OK: Puppet is currently enabled, last run 1 seconds ago with 0 failures [07:11:55] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "LGTM apart from a couple of comments - but you removed mediawiki::hhvm and I don't see references to it changed in the manifests." (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/150506 (owner: 10Ori.livneh) [07:13:42] (03CR) 10Ori.livneh: Add HHVM module (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/150506 (owner: 10Ori.livneh) [07:14:09] RECOVERY - puppet last run on ssl1001 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [07:16:04] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "IMO, this is not something you can solve via puppet this way. The file is written by puppet itself and each time it gets rotated it will b" [operations/puppet] - 10https://gerrit.wikimedia.org/r/150273 (owner: 10Dzahn) [07:19:36] (03PS8) 10Ori.livneh: Add HHVM module [operations/puppet] - 10https://gerrit.wikimedia.org/r/150506 [07:20:53] (03CR) 10Ori.livneh: "PS8 implements Giuseppe's suggestion and restores modules/mediawiki/manifests/hhvm.pp for now (I'll port it to use this module in a separa" [operations/puppet] - 10https://gerrit.wikimedia.org/r/150506 (owner: 10Ori.livneh) [07:22:46] (03PS4) 10Ori.livneh: Nutcracker: move declaration to role::mediawiki; parametrize [operations/puppet] - 10https://gerrit.wikimedia.org/r/149800 [07:35:09] PROBLEM - Puppet freshness on labsdb1004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 00:00:00 UTC [07:43:08] (03CR) 10Giuseppe Lavagetto: [C: 031] Add HHVM module [operations/puppet] - 10https://gerrit.wikimedia.org/r/150506 (owner: 10Ori.livneh) [07:51:22] (03PS3) 10Giuseppe Lavagetto: nginx - remove cipher kEDH+AESGCM [operations/puppet] - 10https://gerrit.wikimedia.org/r/146806 (owner: 10Dzahn) [07:51:48] (03CR) 10Giuseppe Lavagetto: [C: 032] nginx - remove cipher kEDH+AESGCM [operations/puppet] - 10https://gerrit.wikimedia.org/r/146806 (owner: 10Dzahn) [07:56:50] PROBLEM - puppetmaster backend https on palladium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:57:39] RECOVERY - puppetmaster backend https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.053 second response time [08:03:09] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [08:07:36] (03CR) 10Alexandros Kosiaris: [C: 04-2] "This is an old version of the README.Debian file. You probably are still on the debian branch which does not exist anymore as you pointed " [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/150713 (owner: 10Ottomata) [08:14:32] <_joe_> sorry I forgot to merge that [08:15:09] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [08:15:21] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [08:31:39] RECOVERY - Disk space on elastic1016 is OK: DISK OK [08:43:39] <_joe_> !log start rolling reload of nginx to catch up with the new ssl config [08:43:45] Logged the message, Master [09:04:07] (03CR) 10Filippo Giunchedi: "the right solution would be to set the shell's umask in the cron invocation, but that might have other undesirable side effects if puppet " [operations/puppet] - 10https://gerrit.wikimedia.org/r/150273 (owner: 10Dzahn) [09:04:30] <_joe_> godog: that was my thinking as well [09:04:50] (03CR) 10Filippo Giunchedi: restrict access to puppet logs to root users (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/150273 (owner: 10Dzahn) [09:05:00] <_joe_> godog: and I don't think that would affect puppet in general, the "file" resource is managed explicitly [09:05:14] <_joe_> but it needs testing [09:07:41] true it might just behave, not sure it is worth the risk and I think it was discussed to just go ahead with the stopgap and move eventually to syslog [09:31:29] PROBLEM - puppetmaster backend https on strontium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:32:19] RECOVERY - puppetmaster backend https on strontium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.023 second response time [09:32:46] <_joe_> no errors on strontium, just checked... [09:36:09] PROBLEM - Puppet freshness on labsdb1004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 00:00:00 UTC [09:45:24] (03CR) 10Filippo Giunchedi: [C: 031] "note the per-API directory thing might be relevant to https://gerrit.wikimedia.org/r/#/c/150212 and related" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/150506 (owner: 10Ori.livneh) [10:11:11] (03CR) 10Filippo Giunchedi: [C: 04-1] "I think we'd need to discuss this a bit also to find a solution to be put in debian (related perhaps to the per-api (or per-abi?) director" [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/150212 (owner: 10Giuseppe Lavagetto) [10:16:59] <_joe_> godog: about that, I concur, I tried to write up a simplicistic solution but we should do something else. Our main problem being, we have no way to determine if API/ABI has changed, really. see https://github.com/facebook/hhvm/pull/3322 (and the comment within, meh) [10:21:31] _joe_: indeed, yeah we'll have to find a solution but if the number isn't accurate not much else comes to mind ATM [10:21:58] (03PS1) 10Giuseppe Lavagetto: wmflib: add ssl_ciphersuite [operations/puppet] - 10https://gerrit.wikimedia.org/r/150781 [10:23:05] <_joe_> godog: we add a substvar manually and bump it when we know abi changes [10:23:08] <_joe_> :/ [10:23:44] <_joe_> i don't see any other solution, and hhvm guys do not want to commit to abi stability, and I do understand that, sort of [10:25:04] <_joe_> so, we have two ways to do that IMO: 1) we patch hhvm so that --version gives us some useful string for marking an ABI version 2) we hardcode the value in debian/rules [10:25:21] <_joe_> the 1) is my preferred solution ATM [10:25:33] <_joe_> it's a simple patch we can maintain without hassle [10:25:43] <_joe_> and it can be scripted easily [10:26:34] <_joe_> also, I hope in the future we can work with releases and not with the absolute bleeding edge [10:27:19] it is two different things though, it is fine to not have abi stability but would be nice if we had an easy way to know when that happens [10:28:17] yeah releases would make it less frequent perhaps [10:29:45] (03PS2) 10Filippo Giunchedi: swift: monitor object/container availability [operations/puppet] - 10https://gerrit.wikimedia.org/r/149019 [10:29:51] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: monitor object/container availability [operations/puppet] - 10https://gerrit.wikimedia.org/r/149019 (owner: 10Filippo Giunchedi) [10:31:59] !log Jenkins: tweaking jobs labels, that might eventually screw up Zuul/Jenkins entirely. [10:32:05] Logged the message, Master [10:49:12] !log Jenkins: attempting to poll a Trusty slave (integration-slave1004-trusty [10.68.17.148] with label UbuntuTrusty). [10:49:17] Logged the message, Master [10:53:09] PROBLEM - Puppet freshness on db1006 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 08:52:54 UTC [10:53:29] RECOVERY - Puppet freshness on db1006 is OK: puppet ran at Thu Jul 31 10:53:20 UTC 2014 [10:59:19] (03CR) 10Alexandros Kosiaris: [C: 04-1] Mathoid configuration for beta labs (039 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/148836 (owner: 10Physikerwelt) [11:25:12] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Liking the general idea for sure, comments inline" (034 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/150781 (owner: 10Giuseppe Lavagetto) [11:30:53] (03PS1) 10Hedonil: exec_environ.pp: Install libaio1 to enable asynchronous I/O system calls [operations/puppet] - 10https://gerrit.wikimedia.org/r/150787 (https://bugzilla.wikimedia.org/68615) [11:37:09] PROBLEM - Puppet freshness on labsdb1004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 00:00:00 UTC [11:37:37] !log Jenkins: upgrading almost all jobs to use a new label 'UbuntuPrecise' {{bug|68340}} {{gerrit|150785}} [11:37:43] Logged the message, Master [11:41:13] Reedy, will you be the one to switch enwiki 14->15? [12:04:31] !log reloading Jenkins configuration [12:04:35] Logged the message, Master [12:10:22] !log stopping Jenkins and restarting it [12:10:28] Logged the message, Master [12:14:20] snack lunch [12:14:26] jenkins works [12:37:03] (03CR) 10coren: [C: 032] "+package" [operations/puppet] - 10https://gerrit.wikimedia.org/r/150787 (https://bugzilla.wikimedia.org/68615) (owner: 10Hedonil) [12:37:12] (03CR) 10coren: [V: 032] "+package" [operations/puppet] - 10https://gerrit.wikimedia.org/r/150787 (https://bugzilla.wikimedia.org/68615) (owner: 10Hedonil) [12:38:04] * YuviPanda pokes Coren with https://gerrit.wikimedia.org/r/#/c/150425/ [12:38:52] Coren: also, we need the python-txstatsd package built and put into apt.wm.o for the graphite box. I can build the thing now, but don't think I've any rights to put things on apt.wm.o [12:38:57] or even unsure what the process is? [12:39:24] I'm not quite sure what the process is myself since I haven't had to do it yet. :-) [12:39:41] Coren: heh [12:39:54] Coren: I also take back what I said about packages being hard. at least for python... not so much [12:40:27] (03CR) 10coren: [C: 032] "If it breaks any toyes, it'll be those in your own pram. :-)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/150425 (owner: 10Yuvipanda) [12:41:23] Coren: heh, yay :) [12:42:11] (03CR) 10Physikerwelt: Mathoid configuration for beta labs (039 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/148836 (owner: 10Physikerwelt) [12:43:15] (03PS15) 10Physikerwelt: Mathoid configuration for beta labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/148836 [12:59:16] (03CR) 10Giuseppe Lavagetto: wmflib: add ssl_ciphersuite (034 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/150781 (owner: 10Giuseppe Lavagetto) [13:20:47] (03CR) 10Alexandros Kosiaris: [C: 032] spamassassin: qualify vars [operations/puppet] - 10https://gerrit.wikimedia.org/r/149993 (owner: 10Matanya) [13:22:03] (03CR) 10Alexandros Kosiaris: [C: 032] swift:lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/150505 (owner: 10Matanya) [13:22:46] godog: merging this ^. Running through catalog-differ and it is ok [13:23:05] akosiaris: awesome! tyvm sir [13:23:32] thank you both godog and akosiaris [13:25:15] matanya: thank you! [13:29:27] <_joe_> matanya: the puppet masters are *really* grateful for all the work you've done on this [13:30:06] thank you _joe_ :) this is much encouraging ! [13:30:23] <_joe_> btw whenever you have time, take a look at http://etherpad.wikimedia.org/p/Hiera and see if you have comments [13:31:15] yay hiera [13:34:16] (03PS2) 10Giuseppe Lavagetto: wmflib: add ssl_ciphersuite [operations/puppet] - 10https://gerrit.wikimedia.org/r/150781 [13:34:24] (03CR) 10Ottomata: "Weird! I was on the master branch, I must have some weird local branch crap going on. Will check it." [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/150713 (owner: 10Ottomata) [13:34:35] <_joe_> YuviPanda: you as well [13:37:10] _joe_: yeah, leaving some comments about labs [13:37:46] PROBLEM - Puppet freshness on labsdb1004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 00:00:00 UTC [13:38:50] <_joe_> YuviPanda: thanks a lot, I think we should actually start with labs [13:39:00] _joe_: +1 [13:39:07] (03Abandoned) 10Ottomata: Fix for debian/README.Debian [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/150713 (owner: 10Ottomata) [13:39:22] bahhh [13:39:28] hhvm is under mediawiki module ... :/ [13:42:12] <_joe_> hashar: yes but we're building a standalone module [13:42:27] <_joe_> that was a first, quick and dirty class for releasing the jobrunners [13:43:52] (03PS3) 10Giuseppe Lavagetto: mediawiki::web: get rid of envvars.appserver [operations/puppet] - 10https://gerrit.wikimedia.org/r/147514 (owner: 10Ori.livneh) [13:44:24] !log reedy Synchronized php-1.24wmf15/extensions/WikimediaMessages: (no message) (duration: 00m 14s) [13:44:30] Logged the message, Master [13:44:48] !log reedy Synchronized php-1.24wmf15/extensions/RelatedSites/: (no message) (duration: 00m 15s) [13:44:54] Logged the message, Master [13:45:06] <_joe_> mmmh I will not release an apache change during swat [13:45:18] <_joe_> I promise I'll behave [13:45:24] _joe_: added a note about labs heira probably should be a different repo [13:45:29] unsure what that entails, though [13:46:15] <_joe_> YuviPanda: I do agree, I was thinking about doing something like keeping hiera data in operations/puppet for production, and having a separate one for labs maybe [13:46:27] (03PS1) 10Hashar: hhvm: create module + list all dev dependencies [operations/puppet] - 10https://gerrit.wikimedia.org/r/150813 (https://bugzilla.wikimedia.org/63120) [13:46:42] _joe_: hmm, keeping them in operations/heira/production and operations/heira/labs was what I was thinking [13:46:51] in the future, we might even have operations/heira/vagrant :) [13:47:54] !log reedy Started scap: Rebuild 1.24wmf15 l10n cache for WikimediaMessages updates [13:48:00] Logged the message, Master [13:48:10] (03PS2) 10Reedy: Move RelatedSites config to wgExtraInterlanguageLinkPrefixes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/148314 (https://bugzilla.wikimedia.org/41209) (owner: 10TTO) [13:48:31] (03CR) 10Reedy: [C: 031] "Dependencies merged, scap running. -1 removed" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/148314 (https://bugzilla.wikimedia.org/41209) (owner: 10TTO) [13:51:39] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "Not sure if I got what you wanted to achieve here, but this doesn't probably work." (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/150813 (https://bugzilla.wikimedia.org/63120) (owner: 10Hashar) [13:51:39] Error: Failed to apply catalog: Could not find dependent Package[apache2-mpm-worker] for Apache::Mod_conf[mpm_worker] at /etc/puppet/modules/apache/manifests/mpm.pp:46 [13:51:43] puppet is never ending [13:51:53] <_joe_> wut> [13:52:09] <_joe_> hashar: where is that? [13:53:09] <_joe_> YuviPanda: yes that may work [13:53:32] (03CR) 10Hashar: hhvm: create module + list all dev dependencies (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/150813 (https://bugzilla.wikimedia.org/63120) (owner: 10Hashar) [13:53:33] <_joe_> Reedy: please let me know when it's ok for me to fiddle with apache [13:53:47] _joe_: on a Trusty instance integration-slave1004-trusty.eqiad.wmflabs [13:54:11] it uses some contint:: classes [13:55:00] <_joe_> hashar: http://packages.ubuntu.com/search?keywords=apache2-mpm-worker&searchon=names&suite=trusty§ion=all [13:55:17] <_joe_> oh wait [13:55:23] <_joe_> that's saying something different [13:55:26] it is in puppet [13:55:37] Package[apache2-mpm-worker] isn't defined [13:55:41] hurra [13:55:56] <_joe_> correctly not [13:56:23] <_joe_> ok seen the problem [13:56:44] <_joe_> hashar: fixing [13:56:48] <_joe_> that's my fault btw [13:56:48] the labs project (integration) has a local puppetmaster so we can test it out :] [13:57:54] (03CR) 10Nemo bis: [C: 031] "Thanks" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/148314 (https://bugzilla.wikimedia.org/41209) (owner: 10TTO) [14:03:36] (03PS1) 10Giuseppe Lavagetto: apache: remove $rejected_pkgs [operations/puppet] - 10https://gerrit.wikimedia.org/r/150816 [14:03:45] <_joe_> hashar: ^^ [14:03:52] ;-D [14:03:56] cherry picking [14:04:06] <_joe_> take a look and see if it makes sense [14:05:22] on Precise I now have: error ArgumentError: Invalid resource type monitor_group at /etc/puppet/manifests/facilities.pp:110 on node i-000001bd.eqiad.wmflabs [14:05:38] probably unrelated [14:05:55] !log removed labs-in4 and labs-in6 filters on vlan 1117 (labs-hosts1-a-eqiad) on cr[12]-eqiad [14:06:01] Logged the message, Master [14:06:49] yeah transient [14:07:38] (03CR) 10Hashar: "I have cherry picked the patch on 'integration' puppetmaster. Precise instance is happy." [operations/puppet] - 10https://gerrit.wikimedia.org/r/150816 (owner: 10Giuseppe Lavagetto) [14:09:32] _joe_: thanks :° [14:09:47] <_joe_> hashar: eh, that was my error in the first place [14:10:05] <_joe_> it's part of the optimization-OCD ori and I share [14:10:20] <_joe_> I "perfected" his patch [14:10:28] <_joe_> and screwed up [14:10:34] <_joe_> well, only on trusty [14:10:34] !log reedy Finished scap: Rebuild 1.24wmf15 l10n cache for WikimediaMessages updates (duration: 22m 40s) [14:10:40] <_joe_> :P [14:10:40] Logged the message, Master [14:10:47] (03CR) 10Reedy: [C: 032] Move RelatedSites config to wgExtraInterlanguageLinkPrefixes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/148314 (https://bugzilla.wikimedia.org/41209) (owner: 10TTO) [14:10:54] (03Merged) 10jenkins-bot: Move RelatedSites config to wgExtraInterlanguageLinkPrefixes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/148314 (https://bugzilla.wikimedia.org/41209) (owner: 10TTO) [14:11:03] (03PS2) 10Hashar: hhvm: create module + list all dev dependencies [operations/puppet] - 10https://gerrit.wikimedia.org/r/150813 (https://bugzilla.wikimedia.org/63120) [14:11:06] (03PS1) 10Reedy: Revert "Wikivoyages back to 1.24wmf14" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150818 [14:11:09] (03PS2) 10Reedy: Revert "Wikivoyages back to 1.24wmf14" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150818 [14:11:14] (03CR) 10Reedy: [C: 032] Revert "Wikivoyages back to 1.24wmf14" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150818 (owner: 10Reedy) [14:11:19] (03Merged) 10jenkins-bot: Revert "Wikivoyages back to 1.24wmf14" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150818 (owner: 10Reedy) [14:11:37] akosiaris: i think my kafka change is wonky [14:11:44] (03CR) 10Hashar: "PS2 fix a typo in contint server: hvvm -> hhvm" [operations/puppet] - 10https://gerrit.wikimedia.org/r/150813 (https://bugzilla.wikimedia.org/63120) (owner: 10Hashar) [14:11:45] it looks like it is trying to go on the debian branch, when it should be on master [14:12:04] !log reedy Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 14s) [14:12:08] i'm trying to figure out how to switch it to debian branch, but I might have to submit a new gerrit change and abandon that one... [14:12:09] Logged the message, Master [14:12:25] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Wikivoyages to 1.24wmf15 [14:12:30] Logged the message, Master [14:12:37] ottomata: which one? [14:12:49] debian branch is dead btw [14:13:18] (03Abandoned) 10Giuseppe Lavagetto: hhvm: provide hhvm-build-$VERSION [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/150212 (owner: 10Giuseppe Lavagetto) [14:13:25] yeah, but it is still in gerrit? not sure [14:13:28] this one [14:13:29] https://gerrit.wikimedia.org/r/#/c/149889/ [14:13:34] Branch [14:13:34] https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/debs/kafka+branch:debian,n,z [14:13:34] Topic [14:13:34] master [14:13:57] (03PS1) 10Alexandros Kosiaris: Split kafka package into 3 separate packages [operations/debs/kafka] - 10https://gerrit.wikimedia.org/r/150819 [14:14:12] cherry-picked to master [14:14:38] (03PS2) 10Ottomata: Split kafka package into 3 separate packages [operations/debs/kafka] - 10https://gerrit.wikimedia.org/r/150819 (owner: 10Alexandros Kosiaris) [14:14:45] oop [14:15:31] should be mergeable now [14:15:48] ok so, new change then, ok [14:15:51] _joe_: Should be fine to do apache stuffs now [14:16:01] the old one should be mergeable too [14:16:12] gerrit did not let me delete the debian branch [14:16:18] i think I just did the same you did, but via the CLI :p [14:16:21] hm [14:16:22] ok [14:16:24] merging, thanks [14:16:25] as long as there were unmerged changes [14:16:42] (03CR) 10Ottomata: [C: 032 V: 032] "This change was reviewed at https://gerrit.wikimedia.org/r/#/c/149889/" [operations/debs/kafka] - 10https://gerrit.wikimedia.org/r/150819 (owner: 10Alexandros Kosiaris) [14:16:51] aye [14:16:56] abandoning the other [14:17:11] (03CR) 10Hashar: "Cherry picked on integration puppet master. But that needs a bit of work as Giuseppe suggested." [operations/puppet] - 10https://gerrit.wikimedia.org/r/150813 (https://bugzilla.wikimedia.org/63120) (owner: 10Hashar) [14:17:17] (03Abandoned) 10Ottomata: Split kafka package into 3 separate packages [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/149889 (owner: 10Ottomata) [14:17:20] <_joe_> Reedy: thanks :) [14:17:33] let's see if I can delete the gerrit debian branch then.. [14:17:38] ottomata: btw how do you feel about this one https://gerrit.wikimedia.org/r/#/c/147499/ ? [14:17:46] ottomata: still 2 open [14:18:00] the one I mentioned and https://gerrit.wikimedia.org/r/#/c/148287/ [14:18:06] which I suppose we can abandon ? [14:18:07] yeah 14287 [14:18:20] we can abandon that I think, I think I fixed those issues in my change [14:18:31] cool abandoning it then [14:18:55] akosiaris: i've never used pbuilder, let's merge this and I will try to build the packages again using it and your readme instructions [14:19:04] (03Abandoned) 10Alexandros Kosiaris: Fix debian/bin/kafka [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/148287 (owner: 10Plucas) [14:25:29] (03CR) 10Ottomata: [C: 032] "Let's do it!" [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/147499 (owner: 10Alexandros Kosiaris) [14:25:40] akosiaris: cherry pick that one too then? [14:25:58] (btw, i probably won't see these changes unless you add me as reviewer) [14:26:39] ottomata: I tried... it has merge conflicts... resolving them now [14:27:13] (03CR) 10Giuseppe Lavagetto: hhvm: lintian fixes (031 comment) [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/150213 (owner: 10Giuseppe Lavagetto) [14:27:54] (03PS1) 10Alexandros Kosiaris: Use pbuilder by default [operations/debs/kafka] - 10https://gerrit.wikimedia.org/r/150825 [14:28:02] (03PS1) 10Giuseppe Lavagetto: hhvm: lintian fixes [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/150826 [14:28:04] heh... new change :-( [14:28:25] (03Abandoned) 10Alexandros Kosiaris: Use pbuilder by default [operations/debs/kafka] (debian) - 10https://gerrit.wikimedia.org/r/147499 (owner: 10Alexandros Kosiaris) [14:28:37] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] hhvm: lintian fixes [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/150826 (owner: 10Giuseppe Lavagetto) [14:28:51] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] "Already reviewed in https://gerrit.wikimedia.org/r/147499" [operations/debs/kafka] - 10https://gerrit.wikimedia.org/r/150825 (owner: 10Alexandros Kosiaris) [14:29:25] <_joe_> I have a gerrit question [14:29:38] ah... everyone's favourite toy [14:29:41] <_joe_> oh nevermind [14:29:47] ahahahaha [14:29:48] <_joe_> it seems I got it right [14:29:50] A first build: https://integration.wikimedia.org/ci/job/php-FastStringSearch-hhvm-build/1/console [14:29:50] 00:00:05.874 cc1plus: error: /root/hhvm/joe/hhvm: Permission denied [14:29:50] :-D [14:29:58] <_joe_> I don't know why [14:29:59] poor FastStringSearch [14:30:08] <_joe_> hashar: that's fucking hphpize [14:30:16] anyway gotta move. But we have a Jenkins Trusty slave pooled [14:30:19] <_joe_> retaining somewhere the path it's been built into [14:30:32] <_joe_> next time I'll build in /tmp [14:30:33] which runs some experimental compilation of wikidiff2 luasandbox and FastStringSearch [14:30:36] details on https://bugzilla.wikimedia.org/show_bug.cgi?id=63120 [14:30:38] yeah [14:30:49] I am off for now [14:30:57] <_joe_> hashar: filippo discovered that, it sucks [14:31:10] bug fill it ! :-D [14:31:26] I will continue tomorrow [14:31:55] ottomata: done, thanks!!! [14:32:38] (03PS1) 10Reedy: Allow sysops and 'crats on wikimania2014wiki to grant confirmed [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150829 [14:33:12] (03CR) 10Reedy: [C: 032] Allow sysops and 'crats on wikimania2014wiki to grant confirmed [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150829 (owner: 10Reedy) [14:33:17] (03Merged) 10jenkins-bot: Allow sysops and 'crats on wikimania2014wiki to grant confirmed [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150829 (owner: 10Reedy) [14:33:56] !log reedy Synchronized wmf-config/InitialiseSettings.php: Allow sysops and 'crats on wikimania2014wiki to grant confirmed (duration: 00m 15s) [14:34:01] Logged the message, Master [14:38:52] (03PS1) 10BBlack: add labs-hosts1-a-eqiad to dhcpd.conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/150832 [14:39:36] (03CR) 10BBlack: [C: 032 V: 032] add labs-hosts1-a-eqiad to dhcpd.conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/150832 (owner: 10BBlack) [14:57:19] !log added labstore1003 to filter labs-in4 terms allow-labstore-(udp|tcp)4 on cr[12]-eqiad [14:57:24] Logged the message, Master [15:00:05] manybubbles, anomie, Reedy: Respected human, time to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140731T1500). Please do the needful. [15:00:25] * anomie liked "the time is nigh" better [15:00:32] * anomie also observes no patches for SWAT [15:08:45] anomie: YuviPanda got all fancy with the messages and made them random -- https://github.com/wikimedia/wikimedia-bots-jouncebot/blob/master/DefaultConfig.yaml [15:09:18] (03Restored) 10Giuseppe Lavagetto: hhvm: provide hhvm-build-$VERSION [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/150212 (owner: 10Giuseppe Lavagetto) [15:09:26] bd808: oh, did it say something now? [15:09:29] * YuviPanda lost scrollback [15:10:12] YuviPanda: It was "Respected human" this time but I saw "Dear anthropoid" yesterday [15:10:24] bd808: heh :) [15:10:41] sidenote - this makes my ping for the message harder :( [15:11:14] My client apparently won't let me set a watch on everything said by a given user [15:11:19] hmm, bah [15:11:44] * YuviPanda is unsure what to do / fix [15:11:55] meh not a big deal [15:12:49] bd808: I could add a 'notify users' feature, that'll PM you everytime there's a deployment [15:13:08] ew no thanks [15:13:15] bd808: :D [15:13:38] * YuviPanda instead makes icinga-wm PM bd808 every time something's critical [15:14:23] the new jouncebot message seems very discrimantory towards humans undeserving of respect :P [15:14:35] I actually liked it best when it was using /notice [15:14:55] I didn't like /notice, since that didn't actually ping me [15:15:15] bblack: "Lowly worker bee, now is the time to get to work" [15:15:31] * anomie has some sort of crossed meaning in his head between "anthropoid" and "arthropod" [15:15:34] bblack: it'll descriminate against anthropoids who don't want to be called 'dear' 50% of the time [15:15:56] anomie: Ah. My client flashes a desktop announcement for all /notices in all channels [15:29:34] LFaraone: Bah, back now; sorry you had issues overnight. :-( [15:29:51] LFaraone: (DB seems back now.) [15:33:49] (03PS9) 10Ori.livneh: Add HHVM module [operations/puppet] - 10https://gerrit.wikimedia.org/r/150506 [15:34:11] (03CR) 10Ori.livneh: [C: 032 V: 032] Add HHVM module [operations/puppet] - 10https://gerrit.wikimedia.org/r/150506 (owner: 10Ori.livneh) [15:36:55] (03PS1) 10Alexandros Kosiaris: Setup labsdb1006, labsdb1007 as osmdbs [operations/puppet] - 10https://gerrit.wikimedia.org/r/150837 [15:37:45] (03PS1) 10Ori.livneh: hhvm: set hhvm.log.header = true [operations/puppet] - 10https://gerrit.wikimedia.org/r/150838 [15:38:47] PROBLEM - Puppet freshness on labsdb1004 is CRITICAL: Last successful Puppet run was Thu 01 Jan 1970 00:00:00 UTC [15:47:11] (03PS1) 10Alexandros Kosiaris: Bump up postgresql max_connections to 120 [operations/puppet] - 10https://gerrit.wikimedia.org/r/150841 [15:48:11] (03CR) 10Alexandros Kosiaris: [C: 032] Bump up postgresql max_connections to 120 [operations/puppet] - 10https://gerrit.wikimedia.org/r/150841 (owner: 10Alexandros Kosiaris) [15:53:22] (03CR) 10Alexandros Kosiaris: [C: 032] Setup labsdb1006, labsdb1007 as osmdbs [operations/puppet] - 10https://gerrit.wikimedia.org/r/150837 (owner: 10Alexandros Kosiaris) [15:57:56] RECOVERY - Puppet freshness on labsdb1004 is OK: puppet ran at Thu Jul 31 15:57:48 UTC 2014 [15:58:06] PROBLEM - puppet last run on labsdb1004 is CRITICAL: CRITICAL: Puppet last ran 719156 seconds ago, expected 14400 [15:59:06] RECOVERY - puppet last run on labsdb1004 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [16:02:27] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [16:02:36] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet). [16:02:58] shame on me [16:03:27] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [16:03:36] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [16:03:46] PROBLEM - puppet last run on labsdb1005 is CRITICAL: CRITICAL: Puppet has 1 failures [16:03:53] * _joe_ throws rocks to akosiaris [16:04:06] I was sure I had merged btw [16:06:37] (03PS1) 10Giuseppe Lavagetto: hhvm: provide hhvm-api-$VERSION [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/150845 [16:06:46] PROBLEM - puppet last run on labsdb1005 is CRITICAL: CRITICAL: Puppet has 1 failures [16:07:09] (03Abandoned) 10Giuseppe Lavagetto: hhvm: provide hhvm-build-$VERSION [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/150212 (owner: 10Giuseppe Lavagetto) [16:08:46] (03PS2) 10Giuseppe Lavagetto: hhvm: provide hhvm-api-$VERSION [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/150845 [16:09:37] (03CR) 10Ori.livneh: [C: 031] mediawiki: get rid of envvars files in puppet. [operations/puppet] - 10https://gerrit.wikimedia.org/r/150492 (owner: 10Giuseppe Lavagetto) [16:10:20] chasemp: I'm adding a new group to admins. Does the gid derive from ldap such that I should create a group in ldap first? [16:14:11] Not working today but saw this :). they are meant to match but so far it hasn't been worked out. So anything 700 unused should work [16:14:23] ok [16:14:43] I guess the next couple of days are going to be lonesome, what with everyone on their way to Paris :) [16:14:43] andrewbogott: I reply to the RT ticket regarding me btw :) [16:14:51] JohnLewis: great! thank you. [16:15:13] *replied for meta correction-ness :p [16:15:41] JohnLewis: I'm not sure what our RT policies are, I will check in with mutante if/when he comes to work :) [16:16:24] andrewbogott: alright - it was just a 'if I'm here doing this, RT may be useful but I'll let you guys decide whether its worth the effort your side' :) [16:17:00] andrewbogott: paris? [16:17:40] Oh, um, London [16:17:46] I'm confusing wikimania w/OpenStack [16:17:51] (neither of which I"m going to) [16:18:24] hm, I guess I missed the chance to convince YuviPanda that he's in the wrong country :( [16:18:32] heh [16:22:34] (03PS1) 10coren: labstore1003: Give up on Trusty for the time being [operations/puppet] - 10https://gerrit.wikimedia.org/r/150849 [16:22:44] andrewbogott: Can you +2 ^^ plz? [16:22:57] (03PS1) 10Andrew Bogott: New admin group for eventlogging troubleshooting. [operations/puppet] - 10https://gerrit.wikimedia.org/r/150850 [16:23:15] (03CR) 10Andrew Bogott: [C: 032] ":(" [operations/puppet] - 10https://gerrit.wikimedia.org/r/150849 (owner: 10coren) [16:25:22] (03PS1) 10Filippo Giunchedi: add ssh-based uploads to releases.wikimedia.org [operations/puppet] - 10https://gerrit.wikimedia.org/r/150851 [16:26:53] andrewbogott: Did you already merge on carbon as well or not? [16:27:04] just on palladium [16:28:09] andrewbogott: I think I'll be able to sneak one boot in before carbon grabs it; wanna revert in the meantime? [16:30:53] (03PS1) 10Andrew Bogott: Revert "labstore1003: Give up on Trusty for the time being" [operations/puppet] - 10https://gerrit.wikimedia.org/r/150853 [16:30:59] :( [16:31:16] ebernhardson: why :( [16:31:22] :( [16:31:25] (03CR) 10Andrew Bogott: [C: 032] Revert "labstore1003: Give up on Trusty for the time being" [operations/puppet] - 10https://gerrit.wikimedia.org/r/150853 (owner: 10Andrew Bogott) [16:31:44] YuviPanda: i read wrong :P I thought that said give up on trusty, instead of revert give up on trusty [16:31:49] ebernhardson: heh :) [16:32:20] (03CR) 10Filippo Giunchedi: [C: 031] mediawiki::web: get rid of envvars.appserver [operations/puppet] - 10https://gerrit.wikimedia.org/r/147514 (owner: 10Ori.livneh) [16:32:31] ori: I believe we're good ^ [16:32:53] with the code review that is, also as a general statement too [16:33:11] godog: should i merge it? [16:34:09] _joe_: what do you think? re: https://gerrit.wikimedia.org/r/147514 [16:35:04] ori: I would but I'm not overly familiar with testing/rolling out changes to the appservers yet [16:35:15] <_joe_> godog: I can show you [16:35:28] <_joe_> or explain, which is even simpler [16:35:29] <_joe_> :) [16:35:35] <_joe_> in 5 minutes [16:35:44] sure [16:41:03] _joe_: were you ever able to determine if HHVM handles logrotation gracefully? it appears to use SIGHUP for graceful-stop, i don't see anything special for SIGUSR1/2.. [16:41:36] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [16:41:49] <_joe_> ori: HHVM doesn't give a shit about signals, it just dies [16:42:06] <_joe_> I think I opened an issue [16:42:10] _joe_: there's https://github.com/facebook/hhvm/commit/91e8609 [16:42:24] maybe it doesn't work, but at least it claims to graceful on sighup [16:42:36] it appears to work for me but it's hard to test [16:42:49] sorry, i'll let you focus on the convo with godog [16:45:27] matanya: regarding https://gerrit.wikimedia.org/r/#/c/150521/ -- the ticket requests access to ocg-render-admin, which does not exist; your attached patch adds them to pdf-render-admin [16:45:29] can you explain? [16:46:20] <_joe_> ori: yes gimme a minute [16:46:49] <_joe_> you're allowing me to procrastinate some very important but tedious accounting work [16:46:59] <_joe_> the joys of being a contractor... [16:46:59] heh [16:47:52] mark: what's a reasonable time to wait on an app server to graceful-stop before sending SIGKILL? upstart's default is 5 seconds [16:48:17] i'd say a bit more, but I'd have to test [16:48:41] (03CR) 10Filippo Giunchedi: [C: 04-1] hhvm: lintian fixes (033 comments) [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/150213 (owner: 10Giuseppe Lavagetto) [16:49:04] hmm. maybe 10 to start, and we adjust from there? [16:50:24] i'll test on osmium [16:51:18] ori: I'll probably merge/deploy first thing tomorrow, too late/tired now [16:51:33] (03CR) 10Mark Bergsma: [C: 031] seting the scs-[a|c]1-codfw.mgmt.codfw.wmnet entries [operations/dns] - 10https://gerrit.wikimedia.org/r/150629 (owner: 10RobH) [16:51:38] godog: +1. thanks! [16:51:49] godog: and thanks for the other reviews too! [16:52:29] <_joe_> ori: I think the value of Timeout we have in our apache config is a sensible time to wait :) [16:52:54] 200? [16:53:05] <_joe_> we have 200? ouch [16:53:10] <_joe_> that's incredibly long [16:53:32] the default is 300 [16:53:36] but yeah [16:53:54] <_joe_> yes default is for the webserver I run on my mediacenter :P [16:54:02] graceful stop is a bit more urgent tho [16:54:09] <_joe_> I usually set it to 20 or less [16:54:12] <_joe_> ori: yes [16:54:20] <_joe_> so set it to 10 [16:54:21] if you just want to leisurely decom a server then we just depool it and wait for the connections to drain [16:54:26] (03CR) 10Manybubbles: "I'm a bit concerned about the extra IO required to pump through 100MB/sec. Right now requiring the cluster to recover causes some nasty I" [operations/puppet] - 10https://gerrit.wikimedia.org/r/150586 (owner: 10Chad) [16:54:37] * _joe_ nods [16:54:50] <_joe_> ori: I just thought we had a much shorter timeout [16:54:57] i'm going to leave it unset for now (which will make it default to 5) [16:55:10] if i set it at 10 it will become "magical", i.e. people may mistake that for a scientifically determined value [16:55:14] so better to leave it unset for now [16:55:20] (03CR) 10RobH: [C: 032] seting the scs-[a|c]1-codfw.mgmt.codfw.wmnet entries [operations/dns] - 10https://gerrit.wikimedia.org/r/150629 (owner: 10RobH) [16:55:35] <_joe_> eheh [16:55:56] <_joe_> or, you comment your choice declaring it's a guesswork [16:56:43] (03PS2) 10Ori.livneh: Tweaks to HHVM module [operations/puppet] - 10https://gerrit.wikimedia.org/r/150838 [16:56:59] <_joe_> so, now I should really get to do my accounting work sorry [16:57:12] kk. i'll merge that one, it's trivial [16:57:19] unless you object [16:57:23] <_joe_> ori: I may be around later, but I thought I can do a bunch of apache merges tomorrow [16:57:33] _joe_: no worries, take care of your accounting stuff! [16:57:38] thanks for the reviews [16:57:55] <_joe_> +2 for log.header True [16:58:16] oh there is one thing that's possibly controversial [16:58:16] (03CR) 10Giuseppe Lavagetto: [C: 031] Tweaks to HHVM module [operations/puppet] - 10https://gerrit.wikimedia.org/r/150838 (owner: 10Ori.livneh) [16:58:20] respawn limit unlimited [16:58:28] how do you feel about that? [16:58:41] <_joe_> that it does not make sense in general [16:58:57] <_joe_> we should give up eventually if hhvm doesn't respawn [16:59:10] <_joe_> but we can live with that [16:59:28] well, these boxes don't havemuch to do without hhvm anyway:P [16:59:30] <_joe_> also, our experience is showing hhvm fails in less obvious ways than just crashing [17:00:11] ok, let's think about this more. i'll amend the patch to drop that directive [17:00:25] (03PS1) 10Reedy: Add symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150858 [17:00:27] (03PS1) 10Reedy: testwiki to 1.24wmf16 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150859 [17:00:29] (03PS1) 10Reedy: Wikipedias to 1.24wmf15 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150860 [17:00:31] (03PS1) 10Reedy: group0 to 1.24wmf16 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150861 [17:00:48] (03CR) 10Reedy: [C: 032] Add symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150858 (owner: 10Reedy) [17:00:51] (03Merged) 10jenkins-bot: Add symlinks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150858 (owner: 10Reedy) [17:00:51] <_joe_> MaxSem: that's not the reason. If there is some external reason making hhvm fail (like say, a db refusing connections due to a network partition) restarting it furiously can only make things worse [17:01:20] surely a db refusing connections won't make hhvm server die? [17:01:20] (03PS3) 10Ori.livneh: Tweaks to HHVM module [operations/puppet] - 10https://gerrit.wikimedia.org/r/150838 [17:01:23] <_joe_> that's my only doubt [17:01:35] (I think I'm in favor of unlimited respawn) [17:02:13] we can add a brief sleep to the post-stop stanza [17:02:19] Hi ops! Quick question: does anything about caching or production setup make $_SESSION (and consequently WebRequest::getSessionData()/setSessionData()) unreliable? I have a bug in code that uses it, and the bug only shows up on production (on Meta), works fine on my local install [17:02:38] <_joe_> godog: I may recall wrong, but you usually tell hhvm not to respawn if more than N respawn attempts have happpened in M seconds [17:02:59] <_joe_> which is a way to avoid spawn-crash-respawn loops [17:03:02] AndyRussG: what exactly is in $_SESSION ? A cookie only for logged-in users? [17:03:04] (03PS3) 10Reedy: Add export-0.9.xsd [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/149643 (https://bugzilla.wikimedia.org/68686) [17:03:10] AndyRussG, we use session for keeping 100,000s of users logged in [17:03:12] <^demon|away> AndyRussG: No, if you're hitting PHP you're already past varnish. [17:03:15] <_joe_> in any other situation, it will respawn like in the unlimited case [17:03:17] (03CR) 10Reedy: [C: 032] Add export-0.9.xsd [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/149643 (https://bugzilla.wikimedia.org/68686) (owner: 10Reedy) [17:03:21] _joe_: but what's the benefit of staying down? [17:03:21] (03Merged) 10jenkins-bot: Add export-0.9.xsd [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/149643 (https://bugzilla.wikimedia.org/68686) (owner: 10Reedy) [17:03:41] as MaxSem said these servers don't have much else to do [17:03:41] if there was a general problem, we would've already had complaints [17:03:43] <^demon|away> bblack: $_SESSION is the PHP superglobal that contains data you've saved to the user's session. [17:03:49] ah [17:03:57] but we only have sessions for logged-in users, right? [17:04:07] AndyRussG, what kind of unreliability? [17:04:18] PROBLEM - Disk space on labsdb1005 is CRITICAL: DISK CRITICAL - free space: / 33893 MB (3% inode=99%): [17:04:22] (03CR) 10Cscott: "From #mediawiki-pdfhack:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/150521 (owner: 10Matanya) [17:04:31] <^demon|away> bblack: Yeah, as far as PHP is concerned :) [17:04:37] bblack: MaxSem: ^demon|away: Ah hmm... The bug shows up for logged-in and non-logged in users. https://bugzilla.wikimedia.org/show_bug.cgi?id=39212 [17:04:46] And works fine locally for both [17:05:09] <_joe_> ori: it doesn't make that much of a difference anyway [17:05:21] It's just storing a filter value for a paged list [17:05:25] <_joe_> respawn unlimited is good for now, we may decide to change it later [17:05:27] (03PS1) 10Ori.livneh: HHVM: set unlimited respawn limit in Upstart config [operations/puppet] - 10https://gerrit.wikimedia.org/r/150862 [17:05:32] AndyRussG, you're using sessions for anons? [17:05:34] <^demon|away> AndyRussG: You get lucky locally. Nothing in $_SESSION will work for anonymous users. [17:05:39] ok, so merging both [17:05:54] (03CR) 10Ori.livneh: [C: 032] Tweaks to HHVM module [operations/puppet] - 10https://gerrit.wikimedia.org/r/150838 (owner: 10Ori.livneh) [17:05:55] <_joe_> (btw, isn't respawn unlimited the default? it used to be) [17:06:09] nope, respawn limit 10 5 [17:06:14] 10 times in a 5 second period [17:06:19] MaxSem: That seems to be what this code is trying to do [17:06:34] ^demon|away: also doesn't work for logged-ins [17:06:35] <_joe_> well, I hope we never overflow that anyways [17:07:09] <_joe_> ori: btw, will we ever be able to run mediawiki in RepoAuthoritative mode? [17:07:18] yes [17:07:21] it'd be silly not too [17:07:24] <_joe_> oh, wow [17:07:25] https://meta.wikimedia.org/wiki/Special:CentralNoticeBanners [17:07:31] (03PS2) 10Reedy: Remove deprecated $wgHTCPMulticastAddress and replace with $wgHTCPRouting [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150759 (owner: 10Withoutaname) [17:07:35] (03CR) 10Reedy: [C: 032] Remove deprecated $wgHTCPMulticastAddress and replace with $wgHTCPRouting [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150759 (owner: 10Withoutaname) [17:07:37] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [17:07:38] (03Merged) 10jenkins-bot: Remove deprecated $wgHTCPMulticastAddress and replace with $wgHTCPRouting [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150759 (owner: 10Withoutaname) [17:07:38] but it means we'd have to change our approach to debugging /deployment a little [17:07:44] and that's going to be a bit tricky [17:08:02] <_joe_> well, deployments will change completely probably :) [17:08:02] so we decided to defer that to some time until after initial deployment [17:08:10] <_joe_> of course [17:08:17] it's nice to have that up our sleeve tho [17:08:29] <_joe_> I also thought our code didn't play well with repoauth mode [17:08:45] * ^demon|away will miss the ability to just throw a var_dump() in a file and sync it [17:08:46] MaxSem: ^demon|away: bblack: log in there, set a reasonable filter (say, "B14") and go to the next page. Filter bork. [17:08:55] (03PS1) 10Andrew Bogott: Rename pdf-render-admins to ocg-render-admins [operations/puppet] - 10https://gerrit.wikimedia.org/r/150863 [17:09:03] (03PS2) 10Reedy: Fixed "Undefined index: HTTP_X_FORWARDED_FOR" warning [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150707 (owner: 10Aaron Schulz) [17:09:11] (03CR) 10Reedy: [C: 032] Fixed "Undefined index: HTTP_X_FORWARDED_FOR" warning [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150707 (owner: 10Aaron Schulz) [17:09:11] Links to relevant lines in code in the bug [17:09:15] (03Merged) 10jenkins-bot: Fixed "Undefined index: HTTP_X_FORWARDED_FOR" warning [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150707 (owner: 10Aaron Schulz) [17:09:27] <_joe_> like, don't we have all those things like mixed returns, $$n and all the other dirty language hacks I loved when I coded PHP? [17:09:32] (03CR) 10Ori.livneh: [C: 032] "<_joe_> respawn unlimited is good for now, we may decide to change it later / (I think I'm in favor of unlimited respawn)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/150862 (owner: 10Ori.livneh) [17:09:46] <_joe_> eheh CR over IRC [17:09:46] <^demon|away> _joe_: $$n works in HHVM, not hack. [17:10:02] _joe_: we don't have the crazy dynamic things that don't work in repo mode [17:10:08] <_joe_> ^demon|away: I understood it didn't in RepoAuthoritative mode [17:10:16] (03CR) 10Mwalker: [C: 031] "Cool; not a big change at all" [operations/puppet] - 10https://gerrit.wikimedia.org/r/150863 (owner: 10Andrew Bogott) [17:10:22] (03CR) 10Cscott: [C: 031] Rename pdf-render-admins to ocg-render-admins [operations/puppet] - 10https://gerrit.wikimedia.org/r/150863 (owner: 10Andrew Bogott) [17:10:24] _joe_: no, $$n was one of the examples he gave of dynamic language features that *do* work [17:10:38] eval() doesn't iirc [17:10:46] <_joe_> oh ok, works but it's horribly slow right? [17:10:57] <_joe_> eval cannot work in that mode I'd say [17:11:12] MaxSem: ^demon|away: bblack: or maybe it has to do with where $_SESSION is set? It's set from a __destruct() method [17:11:30] <^demon|away> That's a very bad place to try and use the sssion. [17:11:38] <^demon|away> It's likely destroyed (or soon to be destroyed) at that point. [17:12:23] AndyRussG, mobie experimented with sessions for anons on special pages only (they're never cached by varnish), yet still we found it brutally unreliable [17:12:35] I can't imagine how you can make it right [17:13:06] "It's set from a __destruct() method" sounds like a bad idea no matter what we're talking about :) [17:13:21] (03CR) 10Reedy: [C: 04-1] "Guess this has to wait till we deploy wmf17 to group0 and wmf16 to wikipedias (which is 2 weeks from now..)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/149787 (https://bugzilla.wikimedia.org/55678) (owner: 10Legoktm) [17:13:27] PHP is a bad idea in general;) [17:13:49] Reedy: why? [17:14:25] legoktm: Why what? [17:14:33] why do we have to wait? [17:14:53] because the messages won't be there? [17:14:57] ohhhhh [17:15:00] blaah [17:15:07] Oh [17:15:09] chasemp (who is still not working): What if I remove a /group/ from data.yaml? [17:15:12] they exist in the extension? [17:15:17] meh [17:15:21] but we no longer require_once the extension [17:15:39] we could just hack and do the $wgMessageDirs in config [17:15:43] (03CR) 10Chad: "This is one of the ways to help deal with the runaway-disk problems methinks. Because the reallocation algorithm calculates based on sizes" [operations/puppet] - 10https://gerrit.wikimedia.org/r/150586 (owner: 10Chad) [17:15:59] Reedy: l10nupdate or whatever won't backfill the messages? [17:16:11] not if they're not in the branch... [17:16:13] we could backport [17:16:14] :D [17:16:25] <^demon|away> bblack: Considering the new PHP spec leaves a lot of __destruct() behavior undefined! [17:16:34] <^demon|away> HHVM and PHP5 differ on this in several ways. [17:16:58] ^demon|away: i think they left it undefined because the hhvm team disagree with php about that [17:17:11] <^demon|away> Well yeah [17:17:18] MaxSem: bblack: K, yeah to really fix the bug it should to work for anons (even though they wouldn't be the main target audience in this case) [17:17:27] HHVM differs from HHVM on several things in several ways. But luckily there's an API/ABI version to track: the git commit id for the HHVM branch you're running :) [17:17:28] Reedy: https://github.com/wikimedia/mediawiki-extensions-WikimediaMessages/commit/081578e8a4d3391d9c968bf9c6c6744e84ea8ba7 is the twn commit that added all the messages, so we just need that on wmf15 [17:17:56] <^demon|away> bblack: Who needs a version number when you've got a git sha1! [17:18:03] So if sessions are definitely unreliable for anons, I'll just move the filter info to URL params and call it a day [17:18:11] AndyRussG, so what do you want, exactly? where is this session needed? [17:18:42] (03CR) 10Filippo Giunchedi: hhvm: provide hhvm-api-$VERSION (032 comments) [operations/debs/hhvm] - 10https://gerrit.wikimedia.org/r/150845 (owner: 10Giuseppe Lavagetto) [17:18:51] It's just to keep the info about the list filter when you page around the list [17:19:04] Not my code originally, just fixing it [17:19:15] So I don't know the original motivation here [17:19:45] Probably so you can go to any other page (not just the following list pages) and come back to the list and get your previous filter value [17:19:54] on the banner special page? [17:20:01] Yep [17:20:15] can this be done a) client-side or b) with cookies? [17:20:45] (03CR) 10Andrew Bogott: [C: 032] Rename pdf-render-admins to ocg-render-admins [operations/puppet] - 10https://gerrit.wikimedia.org/r/150863 (owner: 10Andrew Bogott) [17:22:07] Hmm I guess it could be stored client-side, though the server needs to get the info to generate the list with a "like" claus [17:22:21] There's also another bug requesting precisely that the filter info be put in the URLs [17:22:35] https://bugzilla.wikimedia.org/show_bug.cgi?id=53753 [17:22:40] (03PS2) 10Reedy: testwiki to 1.24wmf16 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150859 [17:22:46] (03CR) 10Reedy: [C: 032] testwiki to 1.24wmf16 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150859 (owner: 10Reedy) [17:22:50] (03Merged) 10jenkins-bot: testwiki to 1.24wmf16 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150859 (owner: 10Reedy) [17:23:19] (03PS3) 10Andrew Bogott: access : add cscott to pdf admin [operations/puppet] - 10https://gerrit.wikimedia.org/r/150521 (owner: 10Matanya) [17:23:20] So moving it to the URLs would solve both bugs [17:23:21] (03PS6) 10Andrew Bogott: Disable access for mwalker, who is leaving WMF. [operations/puppet] - 10https://gerrit.wikimedia.org/r/150263 [17:23:33] <^demon|away> AndyRussG: Everybody wins! [17:23:47] !log reedy Started scap: testwiki to 1.24wmf16 and build l10n cache [17:23:53] Logged the message, Master [17:24:37] MaxSem: [17:24:48] (oops) [17:24:48] aha [17:25:09] (03CR) 10Andrew Bogott: [C: 032] access : add cscott to pdf admin [operations/puppet] - 10https://gerrit.wikimedia.org/r/150521 (owner: 10Matanya) [17:26:40] ^demon|away: MaxSem: yeah so with continuing to use $_SESSION not an option, I don't see much reason not to pursue the URL param route first... [17:27:45] Also, agreed, PHP _was_ a bad idea... oh well, so it goes [17:31:19] (03PS1) 10Aaron Schulz: Increased "htmlCacheUpdate" throttle limit [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150866 [17:32:27] (03CR) 10Aaron Schulz: [C: 032] Increased "htmlCacheUpdate" throttle limit [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150866 (owner: 10Aaron Schulz) [17:32:31] (03Merged) 10jenkins-bot: Increased "htmlCacheUpdate" throttle limit [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150866 (owner: 10Aaron Schulz) [17:34:46] PROBLEM - puppet last run on ocg1003 is CRITICAL: CRITICAL: Puppet has 1 failures [17:38:06] PROBLEM - puppet last run on ocg1001 is CRITICAL: CRITICAL: Puppet has 1 failures [17:38:49] RECOVERY - puppet last run on ocg1003 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [17:40:07] RECOVERY - puppet last run on ocg1001 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [17:40:46] (03PS1) 10coren: labstore1003: Re-return to Precise [operations/puppet] - 10https://gerrit.wikimedia.org/r/150868 [17:40:58] andrewbogott: With much sadness, ^^ [17:41:13] 'round and 'round she goes! [17:41:35] (03CR) 10Andrew Bogott: [C: 032] labstore1003: Re-return to Precise [operations/puppet] - 10https://gerrit.wikimedia.org/r/150868 (owner: 10coren) [17:42:06] Coren: merged on palladium… can I leave Carbon to you? [17:42:14] Yup. Will do [17:46:23] !log reedy Finished scap: testwiki to 1.24wmf16 and build l10n cache (duration: 22m 35s) [17:46:28] Logged the message, Master [17:47:50] !log aaron Synchronized wmf-config/CommonSettings.php: Increased "htmlCacheUpdate" throttle limit (duration: 00m 07s) [17:47:56] Logged the message, Master [17:53:14] (03PS2) 10Reedy: Wikipedias to 1.24wmf15 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150860 [17:59:49] (03PS1) 10Alexandros Kosiaris: Add instructions for building from trunk [operations/debs/kafka] - 10https://gerrit.wikimedia.org/r/150870 [18:00:05] Reedy, greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140731T1800). Please do the needful. [18:00:45] YOU CAN'T BE SERIOUS [18:01:18] ? [18:01:22] (03CR) 10Reedy: [C: 032] Wikipedias to 1.24wmf15 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150860 (owner: 10Reedy) [18:01:26] (03Merged) 10jenkins-bot: Wikipedias to 1.24wmf15 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150860 (owner: 10Reedy) [18:01:32] (03CR) 10Ori.livneh: [C: 031] "I love good READMEs. +1 for style / spelling etc." [operations/debs/kafka] - 10https://gerrit.wikimedia.org/r/150870 (owner: 10Alexandros Kosiaris) [18:01:56] Reedy: ? :) [18:02:27] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias to 1.24wmf15 [18:02:32] Logged the message, Master [18:03:54] 1 Warning: Missing argument 2 for ApiBase::dieUsage(), called in /usr/local/apache/common-local/php-1.24wmf15/extensions/GettingStarted/api/ApiGettingStartedGetP [18:03:54] ages.php on line 54 and defined in /usr/local/apache/common-local/php-1.24wmf15/includes/api/ApiBase.php on line 1368 [18:04:33] Who's incharge of GettingStarted? [18:05:20] * Reedy files a bug [18:05:46] superm401, phuedx, rmoen [18:05:59] it's very +easy [18:06:07] PROBLEM - puppet last run on labsdb1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:06:29] (03PS2) 10Reedy: group0 to 1.24wmf16 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150861 [18:07:27] (03CR) 10Reedy: [C: 032] group0 to 1.24wmf16 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150861 (owner: 10Reedy) [18:07:46] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [18:07:48] (03Merged) 10jenkins-bot: group0 to 1.24wmf16 [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150861 (owner: 10Reedy) [18:09:17] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.24wmf16 [18:09:23] Logged the message, Master [18:10:34] greg-g: mwalker says that i should poke you and ask if i can pretty please deploy some fixes to the ocg service soonish. ;) [18:10:51] (03PS1) 10Ori.livneh: mediawiki: use HHVM module on trusty [operations/puppet] - 10https://gerrit.wikimedia.org/r/150873 [18:11:07] PROBLEM - puppet last run on labsdb1005 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:12:26] cscott: sure, at 1pm pacific? [18:12:54] ie in 48min? [18:13:02] 1 hour and 48 minutes [18:13:05] (03PS1) 10BBlack: fix labs-hosts1-a-eqiad netboot [operations/puppet] - 10https://gerrit.wikimedia.org/r/150874 [18:13:22] (03CR) 10Ori.livneh: mediawiki: use HHVM module on trusty (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/150873 (owner: 10Ori.livneh) [18:13:29] (03CR) 10BBlack: [C: 032 V: 032] fix labs-hosts1-a-eqiad netboot [operations/puppet] - 10https://gerrit.wikimedia.org/r/150874 (owner: 10BBlack) [18:14:13] greg-g: sure. earlier might be better if at all possible, trying to make the most of mwalker's time. [18:14:25] but i can do 4pm eastern. [18:14:57] cscott: as soon as Reedy is done with the big train deploy today and everything is quiet, really, check back in 30 minutes? [18:15:07] greg-g: ok, works for me, thanks. [18:15:26] (03PS2) 10Reedy: Remove MediaViewer survey-related settings [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146951 (owner: 10Gergő Tisza) [18:15:31] (03CR) 10Reedy: [C: 032] Remove MediaViewer survey-related settings [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146951 (owner: 10Gergő Tisza) [18:15:41] (03Merged) 10jenkins-bot: Remove MediaViewer survey-related settings [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/146951 (owner: 10Gergő Tisza) [18:18:27] (03PS4) 10Reedy: UploadWizard config: Add PD-old-70-1923 license [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145375 (owner: 10Rillke) [18:18:34] (03CR) 10Reedy: [C: 032] UploadWizard config: Add PD-old-70-1923 license [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145375 (owner: 10Rillke) [18:18:39] (03Merged) 10jenkins-bot: UploadWizard config: Add PD-old-70-1923 license [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/145375 (owner: 10Rillke) [18:20:20] _joe_: puppet compiler is borked because puppet-compiler02.eqiad.wmflabs is rejecting jenkins-deploy's SSH key: https://integration.wikimedia.org/ci/computer/puppet-compiler02.eqiad.wmflabs/log [18:20:44] should I file an RT ticket or a bugzilla bug? [18:22:52] (03PS1) 10coren: Revert "labstore1003: Give up on Trusty for the time being" [operations/puppet] - 10https://gerrit.wikimedia.org/r/150875 [18:24:34] andrewbogott: ^^ [18:25:09] (03CR) 10Andrew Bogott: [C: 032] Revert "labstore1003: Give up on Trusty for the time being" [operations/puppet] - 10https://gerrit.wikimedia.org/r/150875 (owner: 10coren) [18:25:26] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There are 2 unmerged changes in mediawiki_config (dir /a/common/). [18:25:40] oooo [18:25:49] It's like a bouncy castle, for patches! [18:26:08] (03PS2) 10Reedy: Remove deprecated $wgCopyrightIcon [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/148568 (owner: 10Withoutaname) [18:26:12] (03CR) 10Reedy: Remove deprecated $wgCopyrightIcon (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/148568 (owner: 10Withoutaname) [18:26:20] (03CR) 10Reedy: [C: 032] Remove deprecated $wgCopyrightIcon [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/148568 (owner: 10Withoutaname) [18:26:29] (03Merged) 10jenkins-bot: Remove deprecated $wgCopyrightIcon [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/148568 (owner: 10Withoutaname) [18:27:44] !log reedy Synchronized wmf-config/: (no message) (duration: 00m 15s) [18:27:49] Logged the message, Master [18:28:16] greg-g: I think I'm just about done... [18:28:26] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [18:28:28] Though just noticed a common fatal in Translate in wmf16... [18:28:30] 19 Fatal error: Call to a member function add() on a non-object in /usr/local/apache/common-local/php-1.24wmf16/extensions/Translate/utils/RcFilter.php on line 33 [18:30:06] :/ [18:31:51] Though I think it's really a core bug [18:32:27] (03PS1) 10Ori.livneh: hhvm: specify type => rvalue for php_ini() [operations/puppet] - 10https://gerrit.wikimedia.org/r/150877 [18:32:39] New code [18:32:46] (03CR) 10Ori.livneh: [C: 032 V: 032] hhvm: specify type => rvalue for php_ini() [operations/puppet] - 10https://gerrit.wikimedia.org/r/150877 (owner: 10Ori.livneh) [18:33:31] https://github.com/wikimedia/mediawiki-core/commit/08fee4ce2f1bbf362c8422a3e306a905c8775094 [18:33:53] (03PS10) 10Withoutaname: Reduce string URLs to defined constant [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/131914 (https://bugzilla.wikimedia.org/48618) [18:39:06] Reedy, the wikimedia project button on every page is broken [18:39:22] lol [18:39:33] I suspect that's 148568 [18:39:35] I'll revert [18:40:01] (03PS1) 10Reedy: Revert "Remove deprecated $wgCopyrightIcon" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150878 [18:40:09] (03CR) 10Reedy: [C: 032] Revert "Remove deprecated $wgCopyrightIcon" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150878 (owner: 10Reedy) [18:40:14] (03Merged) 10jenkins-bot: Revert "Remove deprecated $wgCopyrightIcon" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150878 (owner: 10Reedy) [18:40:40] MaxSem: thanks [18:40:51] !log reedy Synchronized wmf-config/: (no message) (duration: 00m 15s) [18:41:19] (03PS1) 10Ori.livneh: HHVM: fix syntax of php_ini.rb and ini parameter name [operations/puppet] - 10https://gerrit.wikimedia.org/r/150879 [18:41:49] (03CR) 10Ori.livneh: [C: 032 V: 032] "trivial; tested on beta" [operations/puppet] - 10https://gerrit.wikimedia.org/r/150879 (owner: 10Ori.livneh) [18:55:32] (03PS2) 10Ori.livneh: mediawiki: use HHVM module on trusty [operations/puppet] - 10https://gerrit.wikimedia.org/r/150873 [18:58:48] (03CR) 10Kaldari: [C: 031] Add Uploadrestriction for Commons in MF [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150145 (https://bugzilla.wikimedia.org/62598) (owner: 10Florianschmidtwelzow) [19:00:27] (03CR) 10Ori.livneh: "cherry-picked on beta app servers; does the right thing." [operations/puppet] - 10https://gerrit.wikimedia.org/r/150873 (owner: 10Ori.livneh) [19:01:57] <_joe_> ori: let me take a look at that tomorrow, please [19:02:07] _joe_: yep, was not going to merge it :) [19:02:07] <_joe_> but it generally seeems great :) [19:02:18] <_joe_> thanks again [19:02:22] thank you! [19:08:31] (03PS1) 10Ottomata: Separate kafka-mirror out into its own package [operations/debs/kafka] - 10https://gerrit.wikimedia.org/r/150883 [19:11:04] greg-g: how's the train going? [19:11:39] mwalker and i were having fun over in #mediawiki-pdfhack with fsync(2) and NFS. [19:12:19] Reedy: all good? [19:17:34] greg-g: I think so [19:17:45] cscott: go froth [19:17:48] or forth [19:17:57] but if you want a cappicino that's cool too [19:18:10] mmm, capucchino [19:22:36] RECOVERY - Disk space on labsdb1005 is OK: DISK OK [19:23:59] * jeremyb waves HaeB [19:24:37] !log labsdb1005 had to blow away the postgres slave: was using all the space on / because DB at wrong spot (should have been /srv/postgres) [19:24:44] Logged the message, Master [19:25:00] csteipp: hey! OAuth is fataling on wikitech, do you think you'll have a moment to help? [19:25:41] what's wrong with it? [19:25:50] Reedy: see -labs [19:26:00] PHP Fatal error: Call to a member function isSpecial() on a non-object in /srv/org/wikimedia/controller/wikis/slot0/extensions/OAuth/api/MWOAuthAPI.setup.php on line 55 [19:27:02] ofc $context->getTitle() does say @return Title|null [19:27:25] Reedy: hmm, should there be a check in OAuth then? [19:31:35] HaeB: ping [19:40:31] (03CR) 10Ottomata: [C: 032 V: 032] Add instructions for building from trunk [operations/debs/kafka] - 10https://gerrit.wikimedia.org/r/150870 (owner: 10Alexandros Kosiaris) [19:40:44] !log Jenkins build its first hhvm extension \O/ https://integration.wikimedia.org/ci/job/php-FastStringSearch-hhvm-build/2/console [19:40:49] Logged the message, Master [19:41:36] heya, anybody avail for a quick little monitoring brain bounce? [19:41:54] i need to get data out of a hive table, into ganglia and icinga [19:42:01] the data is updated hourly [19:42:08] (03PS1) 10Jgreen: make exim-to-gmetric only log to syslog on failure to stop all the cronspam [operations/puppet] - 10https://gerrit.wikimedia.org/r/150896 [19:42:08] hashar: :) [19:42:29] Reedy: that is more or less like the first step on the moon [19:42:34] does anyone have monitoring already set up somewhere based on data in a mysql table? [19:42:43] Reedy: ridiculously easy to achieve but lot of preliminary work involved ! [19:42:49] springle: know of anything like that? [19:43:22] ottomata: i think we have such a thing in fundraising [19:43:52] i mean pulling data from a table and shoving it into ganglia/nagios [19:43:55] yup [19:43:59] exactly, [19:44:06] it'll be perl though :-) [19:44:08] is that in puppet somewhere? [19:44:24] that's (maybe) ok, i just want to see an example of how it has been done before I make a decision on how to do it [19:44:34] in frack puppet. i can put it somewhere for you. lemme see what we've got [19:45:19] ottomata: crazy idea: push the data to graphite and use the check_graphite pluginn [19:45:45] yeah, woudl do the same for ganglia / check_ganglia [19:45:56] check_graphite is surely better...but all of the hadoop related data is already in ganglia... [19:48:05] ottomata: interestingly...the script I was thinking about is currently AWOL [19:48:16] finding it... [19:49:05] (03CR) 10Jgreen: [C: 032 V: 031] make exim-to-gmetric only log to syslog on failure to stop all the cronspam [operations/puppet] - 10https://gerrit.wikimedia.org/r/150896 (owner: 10Jgreen) [19:49:36] greg-g: ok, about to deploy ocg in earnest. [19:50:07] finished the pre-deploy checks [19:50:34] Jeff_Green: http://exchange.nagios.org/directory/Plugins/Databases/check_mysql_query-2Epl-(Advanced-Nagios-Plugins-Collection)/details ? [19:51:52] ottomata: I looked at that and decided not to use it for some reason I no longer remember [19:52:34] found it [19:54:36] see bast1001:~/check_fundraising_jobs [19:55:31] ok cool, thanks [19:55:52] I wrote it for use in a passive check based on the standard perl plugin example, I don't know of any reason it would not work with nrpe but I also haven't tried it [19:56:12] (03PS1) 10Aaron Schulz: Use the new "dispatcher" config format and use curl with HHVM [operations/puppet] - 10https://gerrit.wikimedia.org/r/150900 [19:56:41] oh well one thing--all the thresholds are build in. you'd probably want to convert those to arguments [19:56:46] hm, ok, get mysql query results, and report to graphite [19:56:46] ja [19:56:47] hm [19:56:49] sorry [19:56:51] to icinga* [19:56:58] how often does this run? [19:57:01] yeah, pretty straightfoward [19:57:08] whenever icinga decides to run it? [19:57:24] ottomata: in frack everything is polled on a 5 minute cron, and reports by passive checks (nsca) [19:57:35] ok... [19:57:36] hm [19:57:48] tim's nrpe review convinced me not to use it [19:58:15] well, i don't need nrpe, doesn't matter, this query could be run from icinga host just fine [19:58:21] its just querying a hive table [19:58:21] hm [19:58:32] (03CR) 10Aaron Schulz: "Requires https://gerrit.wikimedia.org/r/#/c/149923/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/150900 (owner: 10Aaron Schulz) [19:58:34] my problem is that this data will only change every hour [19:58:39] and, it will be about 2 hours behind... [19:58:39] hmmm [19:59:08] maybe mysql is the wrong thing to check then? [19:59:13] well, the data is in a hive table [19:59:16] and generated hourly [19:59:30] i'm checking if there is missing data imported into hdfs [19:59:40] and we only import and check on hourly levels [19:59:50] (03PS1) 10RobH: blog cname migration to wordpress [operations/dns] - 10https://gerrit.wikimedia.org/r/150901 [19:59:52] and if there's missing data, what would the action be? [19:59:57] icinga alert [20:00:01] after that [20:00:07] after that? go figure out why [20:00:13] (03CR) 10RobH: [C: 04-1] "no one should submit this but me, as its required to be timed with the migration" [operations/dns] - 10https://gerrit.wikimedia.org/r/150901 (owner: 10RobH) [20:00:13] look at which hosts are actually reporting missing data: [20:00:18] are they all esams hosts? [20:00:19] etc. [20:00:40] there could also be extra (duplicate) data [20:00:46] we saw that the other day when the esams link had a blip [20:00:52] what process writes the table? [20:00:59] is that itself a script? [20:01:04] uhhh, a hive query / oozie :) [20:01:14] what I'm getting at.... [20:01:23] https://github.com/wikimedia/analytics-refinery/blob/master/oozie/webrequest/partition/add/generate_sequence_statistics.hql [20:01:38] you could do it as a passive check on an hourly interval, pretty sure nagios would support that [20:01:48] so you'd run the check just after the job completes [20:02:05] passive checks warn/alert too [20:02:15] they're just driven from the client side instead of the master side [20:02:58] k reading more... [20:03:34] i have a separate script that runs all the plugins and submits passive checks [20:04:04] that's very straightfoward, there's a command line you call with arguments for each check you want to report [20:04:15] twenty checks means you call the command 20x [20:07:01] ok, i *think* i can get away without having to do the ncsa part, [20:07:10] as long as I can query hive from neon, which I'm sure we set up [20:07:19] right? [20:07:38] probably [20:07:58] ok, Jeff_Green, since I've got your attention, here's a more best way to do this question [20:08:00] so. [20:08:06] the data is in the format [20:08:06] afaik nrpe just calls the plugins remotely [20:08:13] hostname percent_different [20:08:19] (there are more fields, but those are the important ones) [20:08:23] ok [20:08:27] if percent_different = 0.0, then everything is cool [20:08:40] i could: create a service check for each host [20:08:44] that would be every varnish host [20:08:46] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [20:08:50] OR [20:09:01] i could just count the number of records per hour with percent_different != 0 [20:09:09] and just have one service check (not associated with a particular host) [20:09:16] right [20:09:20] so.... [20:09:42] i'd have to review nrpe vs nsca in terms of reporting and host association [20:09:47] but... [20:10:24] if you do an all-host report, you're very limited in terms of how much data you can shove into icinga [20:10:41] ? [20:10:58] look at icinga for host db1025 [20:11:18] there's only so much data you can shove into Status Information [20:11:24] oh, right [20:11:24] yeah [20:11:26] not sure what the limit is [20:11:29] if it was an aggregate check [20:11:34] the status would just say "somethign is wrong!" [20:11:41] that's all I could tell [20:11:44] maybe I could make 2 checks [20:11:51] one for extra data, one for missing data [20:11:55] > 0.0 and < 0.0 [20:11:56] or you could even do "FAIL host.wm.o is 39% different" [20:12:09] just something to consider though [20:12:11] well, usually, if there is missing data, its missing on a lot of hosts at once [20:12:17] k [20:12:21] probably would be better to associate the missing data with the host though? [20:12:22] not sure [20:12:36] its weird though, this would not mean that there is anything wrong on the actual varnish node [20:12:45] I'm not sure if icinga supports reports that have no host association [20:12:47] there might be something weird there, but more likely it is a networking problem, or somethign wonky on the hadoop/kafka end [20:12:52] oh, it doesn't [20:12:56] i started looking into that the other day [20:12:56] if I did an aggregate check [20:12:58] you're sure? [20:13:02] not sure [20:13:04] just assume it doesn't [20:13:10] since all of the service checks require a hostname [20:13:19] but, i'd just associate it with some host somewhere [20:13:22] analytics1010 maybe [20:13:27] analytics1027 maybe [20:13:28] yeah [20:13:38] its weird either way [20:13:39] that's what we've done in other cases [20:13:46] !log updated OCG to version d2919c59eb09e09fc87777696411a070620aef45 [20:13:48] missing data in hdfs doesn't mean that there a problem on a varnish node, so ja [20:13:51] Logged the message, Master [20:14:07] also, when this happens [20:14:14] it usually happens to a good number of hosts all at once [20:14:34] usually either related to a cache type (upload, text, etc.) or a datacenter network problem (esams, etc.) [20:14:43] yeah [20:14:44] so if there is missing or duplicate data, they come in batches of nodes [20:14:53] so, if I did a per-host check [20:14:56] that would be a lot of alerts! [20:15:00] indeed [20:15:09] and it's not as though you run around to every host and fix it, right? [20:15:16] it's a central thing that you're fixing [20:15:27] usually, yeah, very rarely is it a host problem [20:15:32] if it is, it would be maybe just one node [20:15:39] like, varnishkafka stoppep [20:15:40] or something [20:15:44] but we have monitoring on that already [20:16:27] ok cool, thanks for the brain bounce, i'm going to look into doing an aggregate passive check then :) [20:16:57] hmmmmm, i dunno though [20:16:58] hmm [20:17:03] ottomata: look into doing it with nrpe, it's probably feasible [20:17:06] actually, maybe this is weird for icinga [20:17:09] because [20:17:11] lets say [20:17:14] hour 1 has missing data [20:17:15] andrewbogott: sorry, was away, i see it is solved [20:17:18] hour 2 has missing data [20:17:20] cool, we get an alert [20:17:24] then, hour 3 comes along [20:17:26] no missing data [20:17:29] the alert will fix itself [20:17:42] we should still go back an look into why there was missing data [20:17:43] that may be configurable [20:17:57] you could also make your plugin drop a state file [20:18:06] depending on the type of loss (especially duplicates), we can fix data [20:18:09] in the past [20:18:12] i.e. make your plugin stay in fail state until you do something corrective [20:18:17] hm [20:18:46] icinga has a lot of knobs around alerting [20:18:48] i dunno though, maybe this would be better as an emailed report or something [20:19:06] like, here's a list of all hosts missing data this day, etc. or in the past week, or month [20:19:13] yeah [20:19:37] you could make it email a detailed report to normal email destinations, and a short alert to SMS email gateways [20:20:14] i think we won't SMS these alerts, they don't need to be responded to that quickly [20:20:20] but ja [20:20:27] normal email destinations? [20:20:52] i.e. a normal email account, as opposed to an SMS gateway [20:21:30] ah ah, yeah [20:22:11] !log Started populateBacklinkNamespace.php on s1-s3,s5-s7 (commons already running) [20:22:16] Logged the message, Master [20:27:14] qchris: ja so, i'm thinking that using icinga for this isn't ideal, and that an emailed report would be better [20:27:39] we can make icinga do this, but, seeing as missing data is usually only for an hour or two here and there [20:28:25] the icinga alerts will fix themselves probably before we get to them [20:28:26] hm [20:28:28] i dunno, i'm conflicted [20:28:32] that's how udp2log is now... [20:28:50] but ja, the alert would be very simple, and just say "there is missing data in hdfs for the current hour!" [20:28:52] or something like that [20:29:03] an emailed report would give us basically the same info that is in your faulty hosts files [20:29:04] ottomata: an approach I use a lot for stuff like this is structured messages [20:29:09] email messages i mean [20:29:16] ? [20:29:17] all from the same sender, all with a standard subject line [20:29:20] aye [20:29:21] yeah [20:29:32] qchris: has actually done this for udp2log webrequest files before [20:29:39] so I can look in my inbox and scroll down for the ones that start Subect: FAILURE [20:29:43] yeah [20:30:03] i get about 20 of those per day from the fundraising cluster for normal automated maintenance stuff [20:30:20] we probably wouldn't even need to send the email if there wasn't anything wrong [20:30:30] * qchris is still catching up on scrollback [20:30:54] ottomata: I do it so I know jobs have been running [20:31:04] qchris: you probalby don't need anything before :07 this hour [20:31:27] ori: do you remember the ipython notebooks talking to EL data in prod that you had like, years ago? [20:31:27] aye [20:32:04] ori: quarry.wmflabs.org isn't as awesome, but somewhat there :) [20:32:47] ottomata: Email reports would be fine by me. But using Icinga to get them out would be nice. This would buy us the "single monitoring platform" benefits and everything. [20:33:18] ottomata: Also ... no need to do a hive query ... It should suffice to check for the presence of the done marker in hdfs. [20:33:38] If a data set has not done marker, then there is something wrong. [20:33:41] for a really simple check, that's true [20:33:42] hm [20:33:52] or even, i could parse the faulty hosts file :p [20:33:53] but ja [20:34:07] qchris, i think we shoudl get oozie to email us the faulty hosts file too, pretty sure we can do that [20:34:20] We can get oozie to do that. [20:34:32] But failure handling within oozie quickly gets nasty. [20:34:46] hmm, ok so you think I should just make a really simpel icinga check each hour for the existence of the _SUCCESS file? [20:34:50] Like "what happens it the kill state fails" [20:34:58] ottomata: Yes. Totally. [20:35:16] Jeff_G reen suggested an additional status file, so [20:35:16] ok, i will work on that first, if I figure out the passive check thing, shoudl be really easy [20:35:33] yeah, to indicate that something is ok again? [20:35:34] Icinga does not go back to OK when the following hour was good. That [20:35:40] Right. [20:35:49] But that's already bonus. [20:35:55] right, but how would we do that? make sure every hour has _SUCCESS files? [20:35:57] ottomata: just occured to me--you could keep your alerting state in the db too [20:36:03] since you already have the db [20:36:06] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 314 seconds [20:36:07] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 315 seconds [20:36:19] Jeff_Green: totally possible, but hive is kind of cumbersome [20:36:22] very high latency [20:36:31] oic [20:36:54] it would take 20ish seconds to get that result back :p [20:37:08] but, qchris, is that what we want [20:37:25] an alert if there is any missing data at all? i.e. some directories without _SUCCESS files? [20:37:44] as much as i don't like it, we will have missing data, and i doubt we will be able to do something about it sometimes [20:37:53] YuviPanda: checking it out! [20:38:03] Sure, we'll have missing data. [20:38:17] And sure, there'll be datasets that we cannot fix. [20:38:57] But "Icinga sending us an email if the dataset from two hours back does not have a _SUCCESS file" [20:39:06] is good enough for a start. [20:39:26] No need to require that /all/ datasets have a _SUCCESS file. [20:39:29] ok ok [20:39:35] so, the state can flap then [20:39:55] ok, the state file that Jeff_Green suggested would keep the alert state until we fixed it [20:40:02] that's basically how udp2log alerts now anyway [20:40:09] Sounds like it'll be simpler if it can flap. So let's have it flap. [20:40:10] packet loss for some period -> we get an email alert [20:40:12] ok cool [20:40:14] ori: heh, I think I managed to break it just as you were looking at it :) [20:40:15] ok. [20:42:05] PROBLEM - Puppet freshness on db1019 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 20:38:37 UTC [20:44:05] PROBLEM - Puppet freshness on db1019 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 20:38:37 UTC [20:46:05] PROBLEM - Puppet freshness on db1019 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 20:38:37 UTC [20:46:17] ori: hmm, mostly works, but my static serving seems to mess up now and then. might just be my connection :) let me know how it goes for you [20:48:05] PROBLEM - Puppet freshness on db1019 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 20:38:37 UTC [20:50:05] PROBLEM - Puppet freshness on db1019 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 20:38:37 UTC [20:52:05] PROBLEM - Puppet freshness on db1019 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 20:38:37 UTC [20:54:05] PROBLEM - Puppet freshness on db1019 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 20:38:37 UTC [20:56:05] PROBLEM - Puppet freshness on db1019 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 20:38:37 UTC [20:58:05] PROBLEM - Puppet freshness on db1019 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 20:38:37 UTC [20:59:15] RECOVERY - Puppet freshness on db1019 is OK: puppet ran at Thu Jul 31 20:59:11 UTC 2014 [21:00:16] PROBLEM - MySQL Slave Delay on db74 is CRITICAL: CRIT replication delay 306 seconds [21:00:45] PROBLEM - MySQL Replication Heartbeat on db74 is CRITICAL: CRIT replication delay 315 seconds [21:06:55] (03PS2) 10Aaron Schulz: Use the new "dispatcher" config format and use curl with HHVM [operations/puppet] - 10https://gerrit.wikimedia.org/r/150900 [21:13:39] HaeB: Ok, I can push the dns change now, just puttin gin here instead of PM since I'll also admin log it [21:14:01] (03PS3) 10Aaron Schulz: Use the new "dispatcher" config format and use curl with HHVM [operations/puppet] - 10https://gerrit.wikimedia.org/r/150900 [21:14:29] HaeB: but my blog post is locked for edit while you are on it, are you pushing it iwth timestamp? [21:15:51] RobH: you can just take over (that's wordpress' way of managing edit conflicts) [21:16:47] (03CR) 10RobH: [C: 032] blog cname migration to wordpress [operations/dns] - 10https://gerrit.wikimedia.org/r/150901 (owner: 10RobH) [21:17:18] !log blog.wikimedia.org cname changed to migrate over to wp servers [21:17:24] Logged the message, RobH [21:17:41] HaeB: Ok, dns change pushed and is now live, i also updated the corrected blog posting (old server) with the timestamp [21:25:24] and with that, blogs are migrated [21:25:24] woot woot [21:25:24] * RobH puts in ticket to reclaim his potential new blog server, and note to not reclaim holmium for a month [21:26:43] YuviPanda: Ah, missed your ping. I'll get on that. Been meaning to fix that... [21:26:52] csteipp: oh, is that a known issue?} [21:27:01] csteipp: it will be AWESOME if you can, I'm blocked on it :) [21:27:38] YuviPanda: Oh, and to your question in -dev. The sub property should work for you-- that's the centralauth gu_id. [21:27:50] csteipp: ah, cool! halfak just figured that out as ewll. [21:29:14] csteipp: can you add me as reviewer to the fatal fix? [21:32:13] andrewbogott: ^ fyi, re: wikitech OAuth fatal [21:32:43] oh, there's a fix? I saw a patch scroll by, thought it was unrelated [21:33:50] andrewbogott: no fix yet, [21:33:56] andrewbogott: but he says it's fixable :) [21:34:05] hm, ok! [21:34:34] (03PS1) 10coren: Labs: switch dumps to the new server [operations/puppet] - 10https://gerrit.wikimedia.org/r/150967 [21:37:46] (03CR) 10Andrew Bogott: [C: 032] "No more disk full errors!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/150967 (owner: 10coren) [21:42:23] PROBLEM - Puppet freshness on mw1184 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 21:38:51 UTC [21:44:23] PROBLEM - Puppet freshness on mw1184 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 21:38:51 UTC [21:46:23] PROBLEM - Puppet freshness on mw1184 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 21:38:51 UTC [21:48:23] PROBLEM - Puppet freshness on mw1184 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 21:38:51 UTC [21:50:23] PROBLEM - Puppet freshness on mw1184 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 21:38:51 UTC [21:52:23] PROBLEM - Puppet freshness on mw1184 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 21:38:51 UTC [21:54:23] PROBLEM - Puppet freshness on mw1184 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 21:38:51 UTC [21:56:23] PROBLEM - Puppet freshness on mw1184 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 21:38:51 UTC [21:58:23] PROBLEM - Puppet freshness on mw1184 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 21:38:51 UTC [21:58:33] RECOVERY - Puppet freshness on mw1184 is OK: puppet ran at Thu Jul 31 21:58:28 UTC 2014 [21:59:39] (03PS1) 10Yuvipanda: quarry: Switch to halfak's mwoauth library [operations/puppet] - 10https://gerrit.wikimedia.org/r/150970 [22:00:23] PROBLEM - Puppet freshness on mw1184 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 21:58:28 UTC [22:08:51] PROBLEM - Puppet freshness on labsdb1005 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 16:06:51 UTC [22:11:40] beta labs seems to be borked. Can't load any pages :( [22:14:19] and now it's working again [22:18:31] RECOVERY - Puppet freshness on mw1184 is OK: puppet ran at Thu Jul 31 22:18:25 UTC 2014 [22:20:36] cscott: was this what you pushed out today? https://gerrit.wikimedia.org/r/#/c/150865/ [22:21:47] * greg-g takes a break from keyboard for a bit [22:24:30] greg-g: yes, that was the most important bit. https://gerrit.wikimedia.org/r/#/projects/mediawiki/services/ocg-collection,dashboards/default:recent has all the changes, (the 31 july patches) [22:25:13] greg-g: i'd like to start tracking this stuff on https://wikitech.wikimedia.org/wiki/OCG/Deployments (like Parsoid/Deployments does) but it's all a bit ad hoc at the moment [22:36:13] ehm, what's the analog of puppetd -tv in puppet3? [22:36:22] puppet agent -t [22:36:39] thanks:) [22:37:00] <^demon|away> I should add an alias in my .bashrc [22:42:33] I was lost too until spagewmf preciously updated wikitech docs for vagrant :) [22:43:07] MaxSem: puppet agent --disable is the new equivalent :) [22:43:39] `alias puppetd='puppet agent'` [22:43:53] that sounds a bit... incomplete [22:43:59] rm -rf / [22:44:01] ! [22:44:19] bd808: I learnt a week ago that the --no-preserve-root isn't a thing on OS X, which ships with ancient rm [22:46:26] YuviPanda: The only nice things about OSX are wifi, power and the window manager [22:46:31] bd808: :) [22:46:38] For all else there is brew and vagrant [22:51:03] mwalker: re https://bugzilla.wikimedia.org/show_bug.cgi?id=68929, is that fix deployed? [22:51:18] uh [22:51:20] ish [22:51:31] we still get an error on some pages [22:51:34] I don't know how many [22:51:46] we fixed an issue that prevented us from having images at all [22:52:17] maybe the question your asking is: "did the patch in the bug report get deployed" and the answer to that question is "yes" [22:56:47] mwalker: yeah :) [22:57:04] so "backport to WMF" is basically that, even if you all dont do backports(?) [22:57:12] andrewbogott: wikitech is down? [22:57:20] not any more [22:57:27] look at that [22:57:37] andrewbogott: did you figure out what the craziness was? [22:57:42] nope [22:57:58] greg-g, but... backport to me has the meaning "make a patch for a released and supported version of code that does the same thing" [22:58:05] OCG has no released and supported versions beyond master [22:58:11] so... I have no idea what they're talking about [22:58:25] mwalker: that backport to WMF flag [22:58:35] it's code for "get this onto production" [22:58:36] andrewbogott: just rolled back OAuth? [22:58:45] which, with core is "backport to approrpiate wmfXX" [22:58:46] chrismcmahon: Something is definitely hosed on beta labs still. Half the time it won't load anything from bits. And every time I log in it wants to show me everything in German, even though my language is set to English in the prefs :P [22:58:49] (hopefully temporarily) [22:58:53] * YuviPanda leaves andrewbogott alone to debug [22:58:55] greg-g, ahh... that makes no sense at all, it should just be "deploy to production" [22:59:07] mwalker: sure, blame 1 month on the job me :) [22:59:25] damn you 1 month on the job greg-g! [22:59:38] ori: ^ [22:59:40] kaldari: they're talking about it over in #wikimedia-qa [23:00:02] btw... I claim the swat today since it's my last one [23:00:04] RoanKattouw, mwalker, ori, MaxSem, kaldari: Dear anthropoid, the time has come. Please deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140731T2300). [23:00:14] * mwalker makes slurping sounds [23:00:42] :) [23:00:56] who is jouncebot ??! [23:01:15] jouncebot, ping [23:01:17] (03CR) 10Mwalker: "Seems a bit harsh, but OK" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150145 (https://bugzilla.wikimedia.org/62598) (owner: 10Florianschmidtwelzow) [23:01:22] (03CR) 10Mwalker: [C: 032] Add Uploadrestriction for Commons in MF [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150145 (https://bugzilla.wikimedia.org/62598) (owner: 10Florianschmidtwelzow) [23:01:30] (03Merged) 10jenkins-bot: Add Uploadrestriction for Commons in MF [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150145 (https://bugzilla.wikimedia.org/62598) (owner: 10Florianschmidtwelzow) [23:01:43] jeremyb, jouncebot is a bot I wrote that reads the deployment calendar and pokes people [23:01:48] it lives in labs [23:02:22] sorry, I'm in a meeting [23:02:59] i can do the swat deploy if noone else it [23:03:01] s/it/is [23:03:13] ebernhardson, noooo [23:03:14] ebernhardson: mwalker's on it, last one and all [23:03:15] it's mine! [23:03:19] ok :) [23:03:37] mwalker: i have a few last minute patches from jdlrobson and shahyar that i'm adding for you :P [23:03:41] yepyep; I hread [23:03:43] :) [23:03:43] they will appear any moment in wikitech [23:04:03] * ebernhardson hates how every time he goes to wikitech hes logged out [23:05:36] !log mwalker Synchronized wmf-config: Updating configuration for {{gerrit|150145}} (duration: 00m 05s) [23:05:42] Logged the message, Master [23:05:56] no more mobile uploads :'( [23:07:06] RoanKattouw: hey! was your git submodule issue with wikitech that there was... nothing inside the submodule? [23:07:29] RoanKattouw: no .git, nothing? since that seems to be happening now with OAuth, even after init / update [23:07:56] Only a .git file inside [23:08:14] RoanKattouw: hmm, how did you fix it? [23:08:18] I suspect someone should blow away the tree and do a new clone [23:08:22] Yes [23:08:27] YuviPanda: HACKHACKHACK [23:08:28] Remove the dir, do a manual clone [23:08:38] hmm [23:08:40] Which sounds horrible, but seems to BEHAVE CORRECTLY [23:08:46] Like with git submodule update and everything [23:08:49] I don't know why [23:08:49] well, submodule update won't work [23:08:50] oh [23:08:51] they do? [23:08:51] wtf [23:08:56] Yeah wtf [23:08:58] I wonder if there's a buggy version of git on virt1000 [23:09:03] RoanKattouw: earlier, a git pull didn't actually affect the contents of the files :| [23:09:11] RoanKattouw: and even rming the file didn't show up in git diff, or any modifications to it [23:09:15] RoanKattouw: but git log reported correctly [23:09:30] Yup same symptoms for me [23:10:25] mwalker: ok 4 more patches added so you have something to deploy :) [23:10:52] Reedy: RoanKattouw should really get wikitech into some sort of saner system of deployment [23:11:19] There was talk of having it deployed by scap etc [23:11:25] yeah, but composer [23:11:30] lol [23:11:35] "composer" "sane" [23:11:44] heh [23:12:05] the total more sane thing is to write every possible piece of software inhouse :P [23:12:16] I'm going to try to make some time to talk with Andrew about it at the hackathon. [23:12:22] invent your own validation library, because noones ever done that before :P [23:12:23] bd808: he's not gonna be there :( [23:12:33] * bd808 got the composer on prod RFC approved [23:12:40] YuviPanda: damn. after then I guess [23:12:45] ebernhardson: We've got version control and tagging in MW [23:12:49] WE SHOULD STORE ALL CODE ON MW PAGES [23:12:55] Just need FR [23:13:04] ebernhardson: do you know of CodeReview days when CR was done on MW? :) [23:13:06] Just need to fix this wiki blame thing [23:13:17] Reedy: thats it, for next weeks hackathon i'm writing a git->mediawiki importer [23:13:23] Extension:CodeReview is awesome [23:13:31] thats scary :P [23:13:58] I imported a load of diffs into mw.o a while back [23:13:59] https://www.mediawiki.org/wiki/Wikia_code [23:14:31] ebernhardson, these exists a mediawiki / git integration; but it only looks at the master of whats in git and displays it in mediawiki [23:14:41] PROBLEM - Disk space on elastic1016 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 19668 MB (3% inode=99%): [23:15:01] https://github.com/moy/Git-Mediawiki/wiki [23:15:08] which goes the other way too [23:15:18] ebernhardson: https://www.mediawiki.org/wiki/Special:Code/MediaWiki [23:15:27] ebernhardson: if you hadn't seen it before :) [23:15:40] its very sad when you sarcasticly suggest something, and it turns out someone actually thought it was a good idea and programmed it :P [23:15:51] ebernhardson: :D blame brion [23:15:58] * YuviPanda pats ebernhardson [23:16:45] anyways, i'm acutally working on my inter-wiki watchlist service for hackathon, but probably getting offtopic :P [23:22:07] (03PS1) 10Yuvipanda: stats: Install php commandline packages to the crunchers [operations/puppet] - 10https://gerrit.wikimedia.org/r/150985 (https://bugzilla.wikimedia.org/68937) [23:23:04] !log mwalker Synchronized php-1.24wmf16: Updating core and Flow for SWAT (duration: 00m 53s) [23:23:06] ebernhardson, ^ there's the five minutes of doom for resourceloader, but its out there [23:23:08] Logged the message, Master [23:23:21] mwalker: awsome, thanks [23:24:31] PROBLEM - Disk space on vanadium is CRITICAL: DISK CRITICAL - free space: / 4275 MB (3% inode=94%): [23:31:08] greg-g, *tears* last swat deploy [23:33:55] mwalker :( [23:38:00] (03CR) 10Ori.livneh: [C: 032] Use the new "dispatcher" config format and use curl with HHVM [operations/puppet] - 10https://gerrit.wikimedia.org/r/150900 (owner: 10Aaron Schulz) [23:40:51] PROBLEM - Puppet freshness on db1010 is CRITICAL: Last successful Puppet run was Thu 31 Jul 2014 21:40:17 UTC [23:41:41] PROBLEM - Disk space on elastic1016 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 19404 MB (3% inode=99%): [23:44:05] greg-g: Can I sneak in a late SWAT patch? [23:44:13] (I realize SWAT just ended so I can deploy it myself) [23:44:32] We made an embarrassing mistake where we renamed a class but didn't rename it in one place, so now it's crashing VE with a missing class error [23:44:38] https://gerrit.wikimedia.org/r/150989 [23:45:59] Also mwalker ---^^ any help with this (such as merging that patch) would be appreciated [23:46:17] Because, you know, obviously your last SWAT deployment must be made stressful ;) [23:47:07] reasons why I dislike javascript -- refactoring tools get confused [23:47:33] reasons why I dislike programmers -- we forget to use refactoring tools [23:47:49] reasons why I dislike javascript -- it has no compiler to tell me when I forgot to use the refactoring tool [23:48:03] RoanKattouw, I +2'd it; you're welcome to push it out [23:48:09] Thanks matt [23:48:42] Yeah we have tools to tell us about a lot of these things, but not that particular one [23:48:53] Like, jshint catches missing vars [23:49:01] Man, I like JS but I do miss static typing [23:50:59] (03PS1) 10Ori.livneh: HHVM: set an empty string as the value for hhvm.pid_file [operations/puppet] - 10https://gerrit.wikimedia.org/r/150991 [23:51:01] (03PS1) 10Ori.livneh: HHVM: warm up the JIT by making web requests in Upstart post-start [operations/puppet] - 10https://gerrit.wikimedia.org/r/150992 [23:52:10] (03CR) 10TTO: "This was a legacy Wikivoyage extension. As part of Wikivoyage becoming a WMF project, all necessary extensions were installed on the WMF c" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150301 (https://bugzilla.wikimedia.org/68815) (owner: 10Reedy) [23:52:25] RoanKattouw, clearly we need to write a "hiphop for js" [23:52:31] and get a lot of seed money [23:52:36] (03CR) 10Ori.livneh: [C: 032] HHVM: set an empty string as the value for hhvm.pid_file [operations/puppet] - 10https://gerrit.wikimedia.org/r/150991 (owner: 10Ori.livneh) [23:52:40] and then retire to aruba and never look back [23:53:01] haha [23:53:08] mwalker: http://phpjs.org/? [23:53:09] :) [23:53:13] we can be like Gerrit! [23:53:15] I have a Dutch passport, I can retire to Arbua [23:53:17] *Aruba [23:53:25] we already generate HTML with PHP, might as well generate JS :) [23:53:47] e4rr [23:53:49] I meant https://code.google.com/p/php-to-js/ [23:53:57] yurikMskRu, the first sentance on that site says "make use of sensible tools" [23:54:04] and they're referring to php [23:54:09] ack! sorry yuri [23:54:13] YuviPanda: *cough* Xml::encodeJsCall [23:55:26] ... [23:55:28] oh god [23:55:29] sigh [23:57:06] why? i think it's neat :P [23:57:19] well, the fact that it's a method on 'Xml' is a bit wtf [23:57:24] my reaction to Html:: itself is WHY. [23:57:26] * YuviPanda hugs templates [23:57:36] ori: heh :) [23:58:45] (03CR) 10Nemo bis: "Which argument?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/150301 (https://bugzilla.wikimedia.org/68815) (owner: 10Reedy) [23:58:57] ori: did you fidn time to check out quarry.wmflabs.org? :) [23:58:59] *find [23:59:20] i did, but i was nervous about giving a labs-hosted app full access to my wikimedia account [23:59:35] ori: 'full access'? [23:59:44] ori: the perms just give me ability to get your userid, groups and rights. [23:59:44] dunno, that's what the prompt asked for [23:59:51] grrr, phrasing. [23:59:53] i didn't have time to scrutinize :)