[00:05:01] PROBLEM - Disk space on labmon1001 is CRITICAL: DISK CRITICAL - free space: /srv 79836 MB (3% inode=97%): [00:51:20] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 309 seconds [00:52:14] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 356 seconds [00:53:14] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -0 seconds [00:53:36] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [01:19:30] PROBLEM - Disk space on logstash1002 is CRITICAL: DISK CRITICAL - free space: / 16286 MB (3% inode=99%): [01:51:21] PROBLEM - HHVM processes on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:52:41] PROBLEM - Apache HTTP on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:52:51] PROBLEM - DPKG on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:53:00] PROBLEM - check if salt-minion is running on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:53:04] PROBLEM - nutcracker port on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:53:10] PROBLEM - check configured eth on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:53:21] PROBLEM - RAID on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:53:21] PROBLEM - HHVM rendering on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:53:21] PROBLEM - puppet last run on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:53:21] PROBLEM - Disk space on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:53:40] PROBLEM - nutcracker process on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:53:41] PROBLEM - SSH on mw1114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:53:51] PROBLEM - check if dhclient is running on mw1114 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:54:17] sigh. i'll depool it. [01:55:15] !log depooled mw1114 after it became unresponsive, likely [01:55:19] Logged the message, Master [01:56:11] RECOVERY - nutcracker port on mw1114 is OK: TCP OK - 0.000 second response time on port 11212 [01:56:20] RECOVERY - check if salt-minion is running on mw1114 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:56:21] RECOVERY - check configured eth on mw1114 is OK: NRPE: Unable to read output [01:56:22] RECOVERY - Disk space on mw1114 is OK: DISK OK [01:56:30] RECOVERY - puppet last run on mw1114 is OK: OK: Puppet is currently enabled, last run 9 minutes ago with 0 failures [01:56:41] RECOVERY - nutcracker process on mw1114 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [01:56:42] RECOVERY - SSH on mw1114 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [01:56:50] RECOVERY - Apache HTTP on mw1114 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.065 second response time [01:56:50] RECOVERY - check if dhclient is running on mw1114 is OK: PROCS OK: 0 processes with command name dhclient [01:57:00] RECOVERY - DPKG on mw1114 is OK: All packages OK [01:57:21] RECOVERY - RAID on mw1114 is OK: OK: no RAID installed [01:57:21] RECOVERY - HHVM rendering on mw1114 is OK: HTTP OK: HTTP/1.1 200 OK - 72505 bytes in 0.185 second response time [02:16:41] RECOVERY - Disk space on logstash1002 is OK: DISK OK [03:40:50] YuviPanda: awake? [04:01:39] PROBLEM - puppet last run on amslvs4 is CRITICAL: CRITICAL: puppet fail [04:17:08] (03PS1) 10Ori.livneh: hhvm: set substitute-path in gdbinit to point to source tree [puppet] - 10https://gerrit.wikimedia.org/r/172199 [04:20:20] RECOVERY - puppet last run on amslvs4 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [04:21:10] RECOVERY - HHVM processes on mw1114 is OK: PROCS OK: 1 process with command name hhvm [04:24:30] PROBLEM - Disk space on vanadium is CRITICAL: DISK CRITICAL - free space: / 4274 MB (3% inode=94%): [05:04:45] (03PS1) 10QChris: Add jobs for aggregating hourly projectcount files to daily per wiki csvs [puppet] - 10https://gerrit.wikimedia.org/r/172201 (https://bugzilla.wikimedia.org/72740) [06:27:41] PROBLEM - puppet last run on mw1054 is CRITICAL: CRITICAL: puppet fail [06:28:00] PROBLEM - puppet last run on mw1117 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:20] PROBLEM - puppet last run on ms-fe2001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:10] RECOVERY - Disk space on vanadium is OK: DISK OK [06:29:10] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:20] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:30] PROBLEM - puppet last run on db2036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:31] PROBLEM - puppet last run on db1026 is CRITICAL: CRITICAL: Puppet has 1 failures [06:45:00] RECOVERY - puppet last run on mw1117 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:46:19] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [06:46:30] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:46:30] RECOVERY - puppet last run on ms-fe2001 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [06:46:49] RECOVERY - puppet last run on mw1054 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [06:47:39] RECOVERY - puppet last run on db2036 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:49:40] RECOVERY - puppet last run on db1026 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:53:09] PROBLEM - puppet last run on ssl3002 is CRITICAL: CRITICAL: Puppet has 1 failures [07:10:39] RECOVERY - puppet last run on ssl3002 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [07:11:54] (03PS1) 10Steinsplitter: Adding "*.nasa.gov" to wgCopyUploadsDomains. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172204 [07:29:12] (03PS1) 10Ori.livneh: Add unit tests for pybal.ipvs.LVSService [debs/pybal] - 10https://gerrit.wikimedia.org/r/172206 [07:34:36] (03CR) 10Ori.livneh: [C: 032] hhvm: set substitute-path in gdbinit to point to source tree [puppet] - 10https://gerrit.wikimedia.org/r/172199 (owner: 10Ori.livneh) [08:16:54] springle: I assume you're aware of the db1047/db2029/dbstore1001 alerts [08:21:12] !log force-rebooting ms-be2011, kernel "xfs stuck" [08:21:16] Logged the message, Master [08:21:57] root@osmium:~# puppet agent -vt [08:21:57] Notice: Skipping run of Puppet configuration client; administratively disabled (Reason: 'reason not specified'); [08:22:00] ori: that you? [08:22:44] yes; i'm in the middle of testing a patch [08:22:54] i should have set a reason or logged it, sorry [08:23:05] PROBLEM - Host ms-be2011 is DOWN: PING CRITICAL - Packet loss = 100% [08:23:17] puppet patch or hhvm patch? [08:23:21] hhvm [08:23:35] and how's puppet affecting this? [08:23:41] I'm asking because I frequently see this [08:23:52] so perhaps we should just adjust our puppet code to allow such testing [08:23:56] without disabling puppet runs [08:24:33] it would clobber the upstart file, which i hacked to have a different MALLOC_CONF, and refresh the service [08:24:57] ah [08:25:25] not very easy to solve then, unless we stop puppetizing hhvm on osmium altogether [08:26:23] i should just remember to !log, i don't think it's worth worrying about too much beyond that [08:26:50] I don't like alerts piling up, but I'm probably alone in this regard [08:27:39] I thought osmium had a sticky ignore (or whatever the exact icinga equivalent is) for puppet alerts? [08:27:42] Maybe it expired [08:29:38] it probably did, but that wouldn't be great either [08:30:04] it's possible that someone will forget reenabling at some point and then we'll find the box not having run puppet for a month or something [08:30:07] (it has happened before) [08:30:26] anyway, that's a larger conversation and it's probably not the right day/time, at least for you :) [08:32:08] !log ran mklost+found on /srv/postgres for reducing cronspam [08:32:08] No, I don't mind. I'm thinking about it. I was about to say that my need for osmium is exceptional and temporary; I don't expect to be doing this on a regular basis for much longer. But then that's not entirely true, because there is a persistent need for some environment like it. [08:32:12] <_joe_> ori: no reason to disable puppet on osmium anymore [08:32:14] Logged the message, Master [08:32:29] <_joe_> I removed all the puppet roles besides standard from it [08:32:46] <_joe_> just so that you can experiment and _not_ disable puppet [08:32:50] oh! I wish you had told me. I didn't know. [08:32:56] <_joe_> again, a communications fail on my part [08:32:56] <_joe_> :) [08:33:26] <_joe_> yeah the fact we basically work in shift sometimes gets something lost, sorry [08:34:23] paravoid: aye [08:34:50] <_joe_> YuviPanda: not sure creating lost+found is the right solution [08:35:09] <_joe_> maybe disabling the cron may be, it strongly depends on the inherent case [08:35:15] oh wow, was that discussion moot :) [08:35:19] thanks _joe_ [08:35:28] <_joe_> paravoid: which one? [08:35:35] mine with ori [08:35:40] <_joe_> eheh ok [08:36:34] <_joe_> paravoid: I figured that since we're not experimenting with puppet there anymore [08:36:42] yeah makes sense [08:36:46] <_joe_> we should not keep configuring things with it [08:36:48] _joe_: yeah, I thought it was just one or two but now am digging in a bit deeper. [08:36:54] is there someone here who could update the interwiki cache? [08:36:57] <_joe_> YuviPanda: thanks [08:37:11] _joe_: on that note, found one for hhvm :) forwarding to you and ori [08:37:32] sent [08:37:49] I'm wondering if the overall long-term solution would be for osmium to be part of labs [08:37:51] <_joe_> just to me, don't bother ori with cronspam [08:38:03] ah, sorry. already done, tho [08:38:20] <_joe_> paravoid: the long-term solution would proably be to have powerful-enough VMs in labs [08:38:47] sure that too, but having labs-on-metal for certain needs has been discussed before [08:38:51] and I think it's a good idea [08:39:00] e.g. some people wanted to do performance testing [08:39:06] yeah, and the stat machines [08:39:08] <_joe_> I think labs-on-metal is better than misc-on-metal :) [08:39:16] RECOVERY - NTP on elastic1022 is OK: NTP OK: Offset -0.001975893974 secs [08:44:11] (03PS2) 10Ori.livneh: Add unit tests for pybal.ipvs.LVSService [debs/pybal] - 10https://gerrit.wikimedia.org/r/172206 [08:44:31] (03PS3) 10Ori.livneh: Add unit tests for pybal.ipvs.LVSService [debs/pybal] - 10https://gerrit.wikimedia.org/r/172206 [08:45:55] PROBLEM - puppet last run on osmium is CRITICAL: CRITICAL: Puppet last ran 2 days ago [08:46:57] RECOVERY - puppet last run on osmium is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [08:51:38] (03CR) 10Glaisher: "Please remove 102 from wgNamespacesToBeSearchedDefault & wgContentNamespaces arrays as well." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172012 (https://bugzilla.wikimedia.org/73164) (owner: 10Dereckson) [08:52:31] <_joe_> !log dist-upgrading mw1189 to use the latest kernel available, then rebooting [08:52:35] Logged the message, Master [08:56:16] PROBLEM - Host mw1189 is DOWN: PING CRITICAL - Packet loss = 100% [08:57:16] RECOVERY - Host mw1189 is UP: PING WARNING - Packet loss = 73%, RTA = 0.68 ms [09:00:59] (03PS4) 10Ori.livneh: Add unit tests for pybal.ipvs.LVSService [debs/pybal] - 10https://gerrit.wikimedia.org/r/172206 [09:01:47] (03Restored) 10Hashar: Jenkins #1 (please ignore) [debs/pybal] - 10https://gerrit.wikimedia.org/r/84932 (owner: 10Hashar) [09:01:50] (03PS15) 10Hashar: Jenkins #1 (please ignore) [debs/pybal] - 10https://gerrit.wikimedia.org/r/84932 [09:02:39] ori: you are too fast :-] [09:02:50] (03Abandoned) 10Hashar: Jenkins #1 (please ignore) [debs/pybal] - 10https://gerrit.wikimedia.org/r/84932 (owner: 10Hashar) [09:02:56] <_joe_> !log repooling mw1189 at reduced load [09:02:58] Logged the message, Master [09:05:12] hi hashar [09:06:05] ori: it is nice to see some tests being proposed to pybal. I have enqueued jobs for all your patchsets [09:06:35] hashar: awesome, thanks. do you know if pybal is set up anywhere in labs? [09:06:49] ori: it is not setup on beta :/ [09:07:57] oh well, maybe i'll give it a shot [09:10:11] <_joe_> ori: what would you like to test? [09:10:28] <_joe_> the LVS part of pybal or the HA part? [09:10:51] <_joe_> I guess the LVS one; in that case it's easy-peasy to install in labs [09:10:53] ori: for now the app servers are defined as backends of the varnish backend [09:11:01] which is merely because there is no LVS support on labs [09:11:07] <_joe_> oh by "labs" you mean "beta" [09:11:38] _joe_: both [09:12:24] <_joe_> ori: pybal's HA model is based on using bgp, I'm not sure how that would work with openstack networking (I _sincerely_ have no idea) [09:13:27] (03PS3) 10Filippo Giunchedi: jenkins: use openjdk-7-jre-headless [puppet] - 10https://gerrit.wikimedia.org/r/153764 (owner: 10Hashar) [09:13:34] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] jenkins: use openjdk-7-jre-headless [puppet] - 10https://gerrit.wikimedia.org/r/153764 (owner: 10Hashar) [09:14:05] !log Jenkins: switching from Java 6 to Java 7 {{gerrit|153764}} [09:14:07] Logged the message, Master [09:14:41] it should be possible to use test doubles for everything [09:15:05] godog: good morning [09:15:11] godog: any idea why ms-be2011 had a foreign disk? [09:16:19] paravoid: hey :) no I didn't notice that, and afaik nothing failed on that recently? [09:17:10] <_joe_> !log upgrading hhvm across the fleet with new package with debug symbols [09:17:12] Logged the message, Master [09:17:53] _joe_: ori: the poor instance deployment-mediawiki01 /var is filled up due to apache access/error logs :/ [09:18:19] hashar: what do you suggest we do? [09:18:22] the app servers are supposed to use some rsyslog config tweak [09:18:30] and send everything to a central host [09:18:34] while not logging anything locally [09:18:37] they do, but they also save locally, which is useful [09:18:48] agreed :D [09:19:09] we could add a more frequent logrotation script [09:19:11] might want to rotate them more often so, or log to some place with more disk space than /var/ (it is only 2GB since we moved to eqiad) [09:19:14] right [09:19:32] !log Restarting Jenkins to java 7 [09:19:36] Logged the message, Master [09:20:01] i'll submit a patch to add an additional logrotate script for the labs app servers [09:20:35] I have also no clue whether the centralized syslog ends up in logstash [09:20:41] which would be even better :] [09:22:17] RECOVERY - Host ms-be2011 is UP: PING OK - Packet loss = 0%, RTA = 42.90 ms [09:22:17] RECOVERY - very high load average likely xfs on ms-be2011 is OK: OK - load average: 5.98, 1.39, 0.46 [09:22:57] RECOVERY - RAID on ms-be2011 is OK: OK: optimal, 14 logical, 14 physical [09:23:07] it's ok for there to likely be very high load average? that's a confusing alert message [09:26:20] ori: initially it was "very high load average, likely xfs" before discovering that icinga doesn't like commas (!) in error messages [09:26:40] but yes it could be more descriptive [09:27:35] godog: your understated humor is the best :P [09:28:35] hahah [09:38:07] (03PS2) 10Filippo Giunchedi: add graphite-related CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/171525 [09:38:15] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] add graphite-related CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/171525 (owner: 10Filippo Giunchedi) [09:41:27] (03PS2) 10Filippo Giunchedi: txstatsd: add graphite-carbon dependency [puppet] - 10https://gerrit.wikimedia.org/r/171557 [09:41:35] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] txstatsd: add graphite-carbon dependency [puppet] - 10https://gerrit.wikimedia.org/r/171557 (owner: 10Filippo Giunchedi) [09:50:39] PROBLEM - puppet last run on labmon1001 is CRITICAL: CRITICAL: puppet fail [09:51:58] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: puppet fail [10:06:07] <_joe_> !log upgraded hhvm on the whole cluster [10:06:11] Logged the message, Master [10:09:15] (03PS2) 10Dereckson: Remove Anexo namespace on pt.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172012 (https://bugzilla.wikimedia.org/73164) [10:12:24] (03CR) 10Dereckson: [C: 04-1] "Thank you Glaisher, fixed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172012 (https://bugzilla.wikimedia.org/73164) (owner: 10Dereckson) [10:12:31] (03CR) 10Mark Bergsma: [C: 031] varnish/text: really retry on zend requests failing on HHVM [puppet] - 10https://gerrit.wikimedia.org/r/171839 (owner: 10Giuseppe Lavagetto) [10:13:34] (03PS2) 10Faidon Liambotis: Increase max file size of url downloader proxy to 1010mb [puppet] - 10https://gerrit.wikimedia.org/r/172120 (https://bugzilla.wikimedia.org/73200) (owner: 10Brian Wolff) [10:13:42] (03CR) 10Faidon Liambotis: [C: 032] Increase max file size of url downloader proxy to 1010mb [puppet] - 10https://gerrit.wikimedia.org/r/172120 (https://bugzilla.wikimedia.org/73200) (owner: 10Brian Wolff) [10:17:21] (03PS2) 10Giuseppe Lavagetto: varnish/text: really retry on zend requests failing on HHVM [puppet] - 10https://gerrit.wikimedia.org/r/171839 [10:17:47] (03CR) 10Giuseppe Lavagetto: [C: 032] varnish/text: really retry on zend requests failing on HHVM [puppet] - 10https://gerrit.wikimedia.org/r/171839 (owner: 10Giuseppe Lavagetto) [10:18:48] RECOVERY - puppet last run on ms-be2003 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [10:28:35] (03PS1) 10Giuseppe Lavagetto: Divert 15% of anonymous traffic to HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172213 [10:41:17] (03CR) 10Filippo Giunchedi: [C: 031] Divert 15% of anonymous traffic to HHVM (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172213 (owner: 10Giuseppe Lavagetto) [10:43:59] (03CR) 10Giuseppe Lavagetto: Divert 15% of anonymous traffic to HHVM (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172213 (owner: 10Giuseppe Lavagetto) [10:55:21] (03PS1) 10Mark Bergsma: Add error logging for ipvsadm invocations [debs/pybal] - 10https://gerrit.wikimedia.org/r/172215 [10:58:10] (03CR) 10jenkins-bot: [V: 04-1] Add error logging for ipvsadm invocations [debs/pybal] - 10https://gerrit.wikimedia.org/r/172215 (owner: 10Mark Bergsma) [11:07:16] (03CR) 10Hashar: "Thank you to have added the tox environments definition and the Jenkins jobs. The tests looks fine to me as well." (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/172018 (owner: 10Ori.livneh) [11:10:34] (03CR) 10Hashar: "Seems fine, you might want to switch to tox as the entry point though." (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/172019 (owner: 10Ori.livneh) [11:12:16] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "Disable l10nupdate for the duration of CLDR 26 plural migration" [puppet] - 10https://gerrit.wikimedia.org/r/171516 (owner: 10Nikerabbit) [11:14:15] (03CR) 10Hashar: [C: 031] "nice coverage :)" [debs/pybal] - 10https://gerrit.wikimedia.org/r/172089 (owner: 10Ori.livneh) [11:20:18] (03CR) 10Mark Bergsma: [C: 031] "I don't know enough about (Python) unit testing to check all details, but this looks good to me." [debs/pybal] - 10https://gerrit.wikimedia.org/r/172018 (owner: 10Ori.livneh) [11:22:20] (03CR) 10Mark Bergsma: [C: 031] Add unit tests for `pybal.util.LogFile` [debs/pybal] - 10https://gerrit.wikimedia.org/r/172089 (owner: 10Ori.livneh) [11:24:08] (03CR) 10Mark Bergsma: [C: 031] Add unit tests for pybal.ipvs.IPVSManager [debs/pybal] - 10https://gerrit.wikimedia.org/r/172102 (owner: 10Ori.livneh) [11:27:29] (03PS1) 10Springle: Use full IPv4 address to generate MariaDB server_id. [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/172216 [11:28:06] (03CR) 10Mark Bergsma: [C: 031] Add unit tests for pybal.ipvs.LVSService (031 comment) [debs/pybal] - 10https://gerrit.wikimedia.org/r/172206 (owner: 10Ori.livneh) [11:28:54] (03CR) 10Hashar: [C: 031] "All fine to me. IPVSManager.modifyState() is not covered but we can look at it later on (using the mock module to intercept os.popen())." [debs/pybal] - 10https://gerrit.wikimedia.org/r/172102 (owner: 10Ori.livneh) [11:29:19] (03CR) 10Springle: "Anyone (_joe_?) seen a nicer way to do this?" [puppet/mariadb] - 10https://gerrit.wikimedia.org/r/172216 (owner: 10Springle) [11:29:29] springle: uniqueid? [11:29:43] <_joe_> if no one has objections, I will push the 15% to hhvm change live [11:30:08] <_joe_> springle: what paravoid said [11:30:58] <_joe_> springle: https://docs.puppetlabs.com/facter/1.6/core_facts.html#uniqueid [11:31:04] is that a deterministic 32 bit unsigned number? [11:31:08] * springle reads [11:31:10] (which is coreutils' hostid) [11:31:33] http://www.gnu.org/software/coreutils/manual/html_node/hostid-invocation.html [11:31:42] springle: in a hex format but yes [11:32:22] (03CR) 10Giuseppe Lavagetto: [C: 032] Divert 15% of anonymous traffic to HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172213 (owner: 10Giuseppe Lavagetto) [11:32:36] could hanything like ardware changes affect it? [11:33:36] I think it's IP-based [11:33:56] <_joe_> paravoid: not only IP-based, I think [11:34:02] <_joe_> but lemme check it [11:34:30] open("/etc/hostid", O_RDONLY) = -1 ENOENT (No such file or directory) [11:34:34] open("/etc/hosts", O_RDONLY|O_CLOEXEC) = 3 [11:35:02] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=595790 [11:35:10] paravoid@serenity:~$ hostid [11:35:10] 007f0101 [11:35:13] that's 127.0.1.1, so... [11:36:07] In the glibc implementation, the hostid is stored in the file /etc/hostid. (In glibc versions before 2.2, the file /var/adm/hostid was used.) [11:36:07] In the glibc implementation, if gethostid() cannot open the file containing the host ID, then it obtains the hostname using gethostname(2), passes that hostname to gethostā€ [11:36:07] byname_r(3) in order to obtain the host's IPv4 address, and returns a value obtained by bit-twiddling the IPv4 address. (This value may not be unique.) [11:36:26] seems so [11:45:05] hostid seems like an unreliable natural key, to puppet's @ipaddress theoretically reliable surrogate key :) [11:49:32] (03CR) 10Hashar: [C: 031] Add unit tests for pybal.ipvs.LVSService [debs/pybal] - 10https://gerrit.wikimedia.org/r/172206 (owner: 10Ori.livneh) [11:50:51] (03CR) 10Hashar: "Argh sorry, Jenkins jobs requires tox which is setup by the unmerged change https://gerrit.wikimedia.org/r/#/c/172018/" [debs/pybal] - 10https://gerrit.wikimedia.org/r/172215 (owner: 10Mark Bergsma) [11:54:54] !log oblivian Synchronized wmf-config/CommonSettings.php: Open HHVM to 15% of anons (duration: 00m 06s) [11:54:57] Logged the message, Master [11:55:22] <_joe_> losing your connection during a sync-file run is "really nice" [11:57:35] screen ftw _joe_ :) [11:57:56] <_joe_> matanya: yes [12:07:22] (03CR) 10Revi: [C: 031] Adding "*.nasa.gov" to wgCopyUploadsDomains. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172204 (owner: 10Steinsplitter) [12:12:31] (03PS1) 10Matanya: apache: very minor lint [puppet] - 10https://gerrit.wikimedia.org/r/172220 [12:25:40] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: Puppet has 1 failures [12:36:05] (03PS2) 10Giuseppe Lavagetto: monitoring: convert monitor_group to monitoring::group [puppet] - 10https://gerrit.wikimedia.org/r/170727 [12:43:10] RECOVERY - puppet last run on cp3020 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [12:53:39] (03Draft1) 10Filippo Giunchedi: carbon-c-relay: add debian packaging [debs/carbon-c-relay] - 10https://gerrit.wikimedia.org/r/172228 [12:54:32] paravoid akosiaris ^ straightforward debian packaging for carbon-c-relay, should be easy enough :) [12:54:43] looking [12:57:04] possibly controversial, no /etc/init.d script *runs* [12:57:33] (03CR) 10Faidon Liambotis: [C: 04-1] carbon-c-relay: add debian packaging (034 comments) [debs/carbon-c-relay] - 10https://gerrit.wikimedia.org/r/172228 (owner: 10Filippo Giunchedi) [13:01:05] paravoid: thanks that was quick! I have to run to lunch now, will amend when I'm back [13:01:27] (03CR) 10KartikMistry: carbon-c-relay: add debian packaging (032 comments) [debs/carbon-c-relay] - 10https://gerrit.wikimedia.org/r/172228 (owner: 10Filippo Giunchedi) [13:01:54] godog: more useless comments from me ;) [13:02:19] hehe [13:02:34] kart_: haha not useless at all! thanks :)) [13:02:58] but, git-builldpackage will be good with correct distname if one has multile dist set in pbuilder. [13:03:52] I think he should target unstable instead of trusty and upload it into Debian :) [13:04:02] +1 [13:20:32] (03PS1) 10Aude: Enable experimental Wikidata features on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172239 [13:22:22] (03CR) 10Tobias Gritschacher: [C: 031] Enable experimental Wikidata features on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172239 (owner: 10Aude) [13:33:01] (03PS1) 10Hashar: tox env to build test coverage [debs/pybal] - 10https://gerrit.wikimedia.org/r/172243 [13:33:44] (03CR) 10Hashar: "Bryan Davis has setup the coverage for mediawiki/tools/scap :)" [debs/pybal] - 10https://gerrit.wikimedia.org/r/172243 (owner: 10Hashar) [14:13:23] paravoid: indeed, I'll need to mail pkg-graphite too [14:37:45] (03PS2) 10Filippo Giunchedi: carbon-c-relay: add debian packaging [debs/carbon-c-relay] - 10https://gerrit.wikimedia.org/r/172228 [14:37:57] (03CR) 10Filippo Giunchedi: "thanks Faidon and Kartik for the good feedback!" (036 comments) [debs/carbon-c-relay] - 10https://gerrit.wikimedia.org/r/172228 (owner: 10Filippo Giunchedi) [14:38:28] wikimedian DDs unite [14:38:52] lol [14:39:15] what's your stance regarding gbp and upstream using git btw? [14:39:35] I'm always divided on whether I should just import-orig or just branch off upstream's git [14:40:54] heh good question, TBH for packaging purposes I think it makes sense to have master be the packaging branch, which obviously creates confusion when using upstream's git [14:41:26] I tend to import-orig FWIW [14:41:51] for gdnsd it has worked pretty well [14:42:02] the upstream's git I mean [14:42:19] now that bblack is thinking of using submodules though, it may just blow up on my face [14:43:10] hehe, did you use debian/ branches I suppose? [14:43:14] yeah [14:43:30] upstream-tree=tag [14:43:30] debian-branch=debian [14:43:30] upstream-tag = v%(version)s [14:43:36] under debian/gbp.conf [14:43:39] it works fairly well [14:45:00] yep looks nice, have you tried a patch-queue on top of that too? [14:45:20] no, brandon is the perfect upstream :P [14:46:17] hahah that's very convenient alright [14:46:40] I was thinking how it would work having upstream git in operations/debs repos and gerrit [14:48:49] (03PS2) 10Filippo Giunchedi: swift: report statsd data to localhost [puppet] - 10https://gerrit.wikimedia.org/r/171547 [14:49:07] (03CR) 10Filippo Giunchedi: [C: 031] swift: report statsd data to localhost [puppet] - 10https://gerrit.wikimedia.org/r/171547 (owner: 10Filippo Giunchedi) [14:51:08] godog: review collab-maint/geoipupdate if/when you have a moment :) [14:51:22] paravoid: sure! [14:51:42] (feel free to commit as well) [14:53:18] also collab-maint/libmaxminddb [14:55:54] paravoid: I'm assuming git? can't find either at http://anonscm.debian.org/cgit/ [14:56:09] yes, cgit runs on a cronjob every half hour or hour [14:56:13] just git checkout it, it'll be there :) [14:59:53] aha! okay [15:26:40] (03CR) 10Alexandros Kosiaris: [C: 031] carbon-c-relay: add debian packaging [debs/carbon-c-relay] - 10https://gerrit.wikimedia.org/r/172228 (owner: 10Filippo Giunchedi) [15:27:34] (03PS1) 10Gilles: Enable JPG thumbnail chaining on all wikis except commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172254 (https://bugzilla.wikimedia.org/67525) [15:34:16] godog: \o/ on the ITP [15:34:21] I submitted a couple myself today [15:34:31] (03PS1) 10Rush: phab change needs info status to stalled refs T212 [puppet] - 10https://gerrit.wikimedia.org/r/172255 [15:34:37] plus an RFP->ITP [15:34:49] during the weekend [15:35:03] (03CR) 10Alexandros Kosiaris: [C: 04-1] apache: very minor lint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/172220 (owner: 10Matanya) [15:35:30] (03PS2) 10Rush: phab change needs info status to stalled refs T212 [puppet] - 10https://gerrit.wikimedia.org/r/172255 [15:36:39] (03CR) 10Rush: [C: 032] phab change needs info status to stalled refs T212 [puppet] - 10https://gerrit.wikimedia.org/r/172255 (owner: 10Rush) [15:36:41] (03PS2) 10Matanya: apache: very minor lint [puppet] - 10https://gerrit.wikimedia.org/r/172220 [15:37:11] (03CR) 10Alexandros Kosiaris: [C: 032] apache: very minor lint [puppet] - 10https://gerrit.wikimedia.org/r/172220 (owner: 10Matanya) [15:39:39] godog: collab-main/python-maxminddb too please :) [15:41:58] manybubbles, marktraceur, ^d: Who wants to SWAT this morning? [15:42:18] I'm less inclined this morning if that is ok [15:42:19] I think I'm going to pass this one [15:42:22] I can still do [15:42:24] if needed [15:43:33] (03CR) 10KartikMistry: [C: 031] "Looks good. I haven't built package though :)" [debs/carbon-c-relay] - 10https://gerrit.wikimedia.org/r/172228 (owner: 10Filippo Giunchedi) [15:43:35] <^d> I've got a deploy in an hour and a couple other things on my plate, pass. [15:43:57] Fine, I'll do it. q: [15:44:08] Thanks anomie. [15:45:14] paravoid: haha sure, unlikely it'll be today tho [15:46:28] (03PS3) 10Giuseppe Lavagetto: monitoring: convert monitor_group to monitoring::group [puppet] - 10https://gerrit.wikimedia.org/r/170727 [15:47:41] (03CR) 10Giuseppe Lavagetto: [C: 032] monitoring: convert monitor_group to monitoring::group [puppet] - 10https://gerrit.wikimedia.org/r/170727 (owner: 10Giuseppe Lavagetto) [15:48:46] thanks [15:50:04] !log reedy Synchronized wmf-config/interwiki.cdb: Updating interwiki cache (duration: 00m 14s) [15:50:11] Logged the message, Master [15:50:26] aude, gi11es: Ping for SWAT in 10 minutes [15:50:34] anomie: pong [15:50:49] (03PS1) 10Reedy: Rebuild iw cache for maiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172259 [15:51:13] (03CR) 10Reedy: [C: 032] Rebuild iw cache for maiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172259 (owner: 10Reedy) [15:51:21] (03Merged) 10jenkins-bot: Rebuild iw cache for maiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172259 (owner: 10Reedy) [15:52:29] PROBLEM - puppet last run on cp1038 is CRITICAL: CRITICAL: puppet fail [15:52:43] (03PS2) 10Reedy: Remove old AdminSettings.php (symlink) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/145408 [15:53:38] * aude here [15:54:49] (03PS3) 10Reedy: Remove old AdminSettings.php (symlink) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/145408 (https://bugzilla.wikimedia.org/67820) [16:00:05] manybubbles, anomie, ^d, marktraceur, aude: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141110T1600). [16:00:11] * anomie begins SWAT [16:00:14] aude: You're first [16:00:17] ok [16:00:25] (03PS2) 10Anomie: Enable experimental Wikidata features on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172239 (owner: 10Aude) [16:00:33] (03CR) 10Anomie: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172239 (owner: 10Aude) [16:00:40] (03Merged) 10jenkins-bot: Enable experimental Wikidata features on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172239 (owner: 10Aude) [16:01:01] !log anomie Synchronized wmf-config/Wikibase.php: SWAT: Enable experimental Wikidata features on labs [[gerrit:172239]] (duration: 00m 09s) [16:01:02] aude: ^ Test please [16:01:06] Logged the message, Master [16:01:14] gi11es: You're next [16:01:20] PROBLEM - puppet last run on neon is CRITICAL: CRITICAL: puppet fail [16:02:00] anomie: looks ok [16:02:08] <_joe_> mmmh [16:02:09] (03PS2) 10Anomie: Enable JPG thumbnail chaining on all wikis except commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172254 (https://bugzilla.wikimedia.org/67525) (owner: 10Gilles) [16:02:16] <_joe_> that sounds like my fail ^^ [16:02:20] <_joe_> (neon) [16:02:31] (03CR) 10Anomie: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172254 (https://bugzilla.wikimedia.org/67525) (owner: 10Gilles) [16:02:38] (03Merged) 10jenkins-bot: Enable JPG thumbnail chaining on all wikis except commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172254 (https://bugzilla.wikimedia.org/67525) (owner: 10Gilles) [16:03:00] !log anomie Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable JPG thumbnail chaining on all wikis except commons [[gerrit:172254]] (duration: 00m 09s) [16:03:01] gi11es: ^ Test please [16:03:04] Logged the message, Master [16:03:10] anomie: testing... [16:10:10] (03PS1) 10Filippo Giunchedi: txstats/graphite: fix package conflict, use require_package [puppet] - 10https://gerrit.wikimedia.org/r/172261 [16:10:49] RECOVERY - puppet last run on cp1038 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [16:10:55] anomie: not seeing the effect yet, I'll try again in 5 minutes [16:11:10] (03PS2) 10Filippo Giunchedi: txstats/graphite: fix package conflict, use require_package [puppet] - 10https://gerrit.wikimedia.org/r/172261 [16:14:42] * anomie double-checks that yes, the change actually did get deployed. Appears to have been. [16:14:54] _joe_: https://gerrit.wikimedia.org/r/#/c/172261/ looks good? [16:16:01] (03CR) 10Manybubbles: [C: 031] "Works in beta. Needs I7c1a8281388b1ff4b6aea4f5fb176b14317d9bce from Cirrus to display error messages properly. Since we've turned off re" [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/172032 (owner: 10Manybubbles) [16:16:44] <_joe_> godog: 1 sec sorry I was after my own error :P [16:16:57] popping the error stack [16:17:40] <_joe_> godog: one risk you have is require_package working with requires can sometimes be tricky [16:18:29] (03CR) 10Giuseppe Lavagetto: [C: 031] "Look out for failures between require_package and require => Package around your code though." [puppet] - 10https://gerrit.wikimedia.org/r/172261 (owner: 10Filippo Giunchedi) [16:18:52] oh, mhhh [16:19:36] _joe_: out of curiosity why is that? [16:19:38] (03PS1) 10Giuseppe Lavagetto: nagios: fix virtual resource collection [puppet] - 10https://gerrit.wikimedia.org/r/172264 [16:19:39] <_joe_> but, let's try and fix it if it fails [16:20:13] <_joe_> godog: I think puppet is too dumb for the tricks we pull there and sometimes the compiler can't see that the resource has been declared [16:20:15] anomie: still not seeing the effect on enwiki... [16:20:27] <_joe_> I've seen that happen once but I don't remember how I fixed that [16:20:44] (03CR) 10Giuseppe Lavagetto: [C: 032] nagios: fix virtual resource collection [puppet] - 10https://gerrit.wikimedia.org/r/172264 (owner: 10Giuseppe Lavagetto) [16:20:56] _joe_: okay I'll try it and run puppet on tungsten [16:21:09] (03PS3) 10Filippo Giunchedi: txstats/graphite: fix package conflict, use require_package [puppet] - 10https://gerrit.wikimedia.org/r/172261 [16:21:16] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] txstats/graphite: fix package conflict, use require_package [puppet] - 10https://gerrit.wikimedia.org/r/172261 (owner: 10Filippo Giunchedi) [16:21:21] Morning ops [16:21:23] * AndyRussG waves [16:23:01] Quick question: if I write a method in PHP to make a HTTP request to a the web API of another wiki also on the cluster (as in, any WMF wiki -> calls an API on Meta) is the response of the API call cached, and if so, for how long? [16:24:00] <_joe_> AndyRussG: why would you care? [16:24:11] Aaaaah do u really wanna know? [16:24:23] It's for a new way of serving banners for CentralNotice [16:24:33] <_joe_> lemme rephrase that: our cache is evicted when an article is modified [16:24:49] RECOVERY - puppet last run on neon is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [16:24:52] Instead of having the server decide everything we're sending a list of choices to the client via a RL module [16:24:56] tentatively [16:25:11] <_joe_> so why would cache duration matter? [16:25:15] _joe_: what about an API response? [16:25:37] <_joe_> AndyRussG: an api response about what? [16:25:51] It's a new API that we're adding on CentralNotice [16:26:02] Here's the change https://gerrit.wikimedia.org/r/#/c/170843/ [16:26:39] I've tried to optimize... Pls feel free to tear it to shreds on any issue whatsoever [16:26:43] <_joe_> ok, I would need to take a better look at how eviction works, to be able to answer about a new api :) [16:27:13] (03CR) 10Reedy: [C: 031] "This should be good to go now..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/145408 (https://bugzilla.wikimedia.org/67820) (owner: 10Reedy) [16:27:22] _joe_: Ah OK, if you have the chance to do that, that'd be really fantastic :D [16:27:28] <_joe_> but I think other people might already know the answer [16:27:45] * AndyRussG gets ready to ping... ? [16:27:45] I was about to type a quick answer, but even I have to check first [16:28:02] * bd808 introduces AndyRussG to anomie for questions about spooky api actions [16:28:20] * anomie doesn't know much about the caching side of it [16:29:40] Hmm... RoanKattouw gave me some tips before on somewhat related stuff... [16:30:10] So in general, I think for API requests the cache headers on the app-layer output control cacheability (keep in mind Vary, params, and that if logged in then the login cookie uncaches it all) [16:30:25] but then also if the app layer says 0s for it will become 120s [16:31:06] bblack: 120s is certainly more than fast enuf [16:31:09] so yeah [16:31:27] It'll be called anonymously [16:31:50] (03PS1) 10Filippo Giunchedi: work around puppet function arity issues [puppet] - 10https://gerrit.wikimedia.org/r/172266 [16:31:50] Should I set the header from PHP? Or what would the default header say for an API? Maybe I should just use that? [16:31:55] ok, so for an anon API request, for the exact same params in the URL and other variance, you'd expect 120s I think [16:32:16] (unless the API layer sets it longer) [16:32:42] I have no idea tbh, how/where the API code's output cache headers are set, or why/how [16:32:53] bblack: sounds great :) I guess that's something I could have just checked by calling a few APIs, heh [16:32:57] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] work around puppet function arity issues [puppet] - 10https://gerrit.wikimedia.org/r/172266 (owner: 10Filippo Giunchedi) [16:32:59] * AndyRussG removes glasses to facepalm [16:34:13] bblack _joe_ anomie another option that was bandied (without any investigation) was to try to call Meta's DB directly [16:34:24] Sounds messier tho [16:34:41] ^bandied about* [16:35:20] AndyRussG: Just don't do what TimedMediaHandler does. As for API caching headers, one thing to keep in mind is that the API needs values for the 'maxage' and/or 'smaxage' parameters to output public caching. [16:36:38] (03CR) 10Chad: [C: 032 V: 032] Update elasticsearch plugins to fix regex issue [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/172032 (owner: 10Manybubbles) [16:37:27] anomie: right, I've seen those headers set directly from PHP, though I kinda feel like if the defaults are good, I should go with them [16:38:00] It doesn't really matter if it's 120s or 600s, just so long as it's not 2 weeks, and we can reliably predict the duration [16:38:33] I guess caching won't work any differently if the request is coming from the cluster [16:39:48] (what does TimedMediaHandler do, out of curiosity?) [16:42:06] Hmm, not logged in, I get Cache-Control:"no-cache" for http://en.wikipedia.org/w/api.php?action=query&list=allcategories&acprop=size [16:42:20] (also fine) [16:43:18] anomie: I see the right values with eval.php on terbium for enwiki, so the config change looks fine, not sure why the code doesn't kick in. do image scalers get their config in a way that InitialiseSettings.php values wouldn't apply to them? [16:43:47] gi11es: I have no idea. Reedy? ^ [16:44:38] Nope... [16:44:51] They should be scapped out with mediawiki-installation [16:45:04] so a thumbnailing job for a given wiki should get the config for that wiki? [16:45:24] I believe so [16:45:45] Oh, aren't the job loops long(ish) running? [16:45:50] (03PS1) 10Filippo Giunchedi: fix require_package documentation and txstatsd/graphite usage [puppet] - 10https://gerrit.wikimedia.org/r/172270 [16:45:56] So there'd be some time between picking up new config changes [16:46:06] any idea how long those are? [16:46:17] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] fix require_package documentation and txstatsd/graphite usage [puppet] - 10https://gerrit.wikimedia.org/r/172270 (owner: 10Filippo Giunchedi) [16:46:31] it's already been 40 minutes or so since the deploy [16:46:32] I'm confusing things [16:46:43] Does thumbnailing actually use the job queue? [16:47:13] they run on the image scaler servers, that's all I know [16:47:24] Transcodes definitely do [16:47:47] thumb.php returns the image when it's been scaled... [16:47:59] Sometimes, just touching InitialiseSettings.php and syncing fixes things like this [16:48:15] (03PS1) 10Chad: CirrusSearch on zhwiki as default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172272 [16:48:17] * Reedy does it [16:48:28] !log reedy Synchronized wmf-config/InitialiseSettings.php: tocuh (duration: 00m 14s) [16:48:33] Logged the message, Master [16:49:59] RECOVERY - puppet last run on labmon1001 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [16:50:27] !log deployed new versions of elasticsearch plugins to fix regex querying [16:50:31] Logged the message, Master [16:50:34] gi11es: what's the visible change supposed to be? [16:50:58] AndyRussG: re: Cache-Control on your not-logged-in API request: we add that after taking care of varnish caching (so it applies to browsers and external caches, but doesn't necessarily describe what's happening in Varnish) [16:51:33] bblack: ah got it [16:51:42] Reedy: that when a small thumbnail size is requested, it's generated based on a chain of larger sizes. so if you upload a new file, and request say a 100px thumbnail, the 128px, 256px, 512px ones should be rendered at that point [16:51:49] and requesting those shouldn' be a total miss [16:51:55] In *theory* touching InitializeSettings.php shouldn't change anything anymore as sync-common does it locally on each host after it rsyncs -- https://github.com/wikimedia/mediawiki-tools-scap/blob/master/scap/tasks.py#L285-L291 [16:52:07] !log restarting elastic1001 to pick up new plugins. [16:52:10] Logged the message, Master [16:52:14] bd808: Right... [16:52:26] I've seen it help now and again [16:52:39] theory and reality often diverge :/ [16:53:05] I wonder if the touch should be moved/duplicated to the source side too [16:54:01] bblack _joe_ anomie thanks much :D [16:55:27] Reedy: if you know where I can read the prod thumbnail files, though, checking that things work becomes easier. just upload a new file, visit its page (which will hit the 120px thumbnail) and see if you have the 128 thumbnail generated in the folder [16:55:57] I know where to look for that on labs but I'm not sure where to find thumbnail files on prod [16:56:25] actually, 256 would exist for 120, but not 128 [16:56:35] gi11es: They're in swift, so they're not obviously on disk... [16:56:38] godog: about? :) [16:57:14] ah yes, then it's a matter of querying swift, but that's pretty much the same as hitting the url and looking at the headers in practice [16:57:49] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [16:58:01] I just want to make sure that whatever machine render the thumbnail gets the right config values. if that's the case, then I know the problem is elsewhere [16:58:27] Reedy: sure [16:58:57] so hitting the url would be the same thing if it shows up as cached it was there already [16:59:21] right so, what I want to clear up is what servers render the thumbnails [16:59:57] you can find that easily via ganglia [17:00:01] https://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&m=cpu_report&s=by+name&c=Image%2520scalers%2520eqiad&tab=m&vn=&hide-hf=false [17:00:05] manybubbles, ^d: Dear anthropoid, the time has come. Please deploy Search (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141110T1700). [17:00:33] ok, so it is machines from the image scalers cluster. in that case, what wiki config do they run? [17:01:11] since the path is web server -> varnish -> swift -> image scaler I doubt that the image scaler is aware that the request was for enwiki originally, right? [17:01:16] (03CR) 10Chad: [C: 032] CirrusSearch on zhwiki as default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172272 (owner: 10Chad) [17:01:24] (03Merged) 10jenkins-bot: CirrusSearch on zhwiki as default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172272 (owner: 10Chad) [17:01:28] they should be the same as the rest of the app servers that run mw [17:01:37] bblack: out of curiosity just to see if I'm getting stuff--is this set in Puppet/templates/varnish/text-backend.inc.vcl.erb ? [17:02:04] !log demon Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 05s) [17:02:08] Logged the message, Master [17:03:10] gi11es: [17:03:11] if ( $scriptName === '/w/thumb.php' && $serverName === 'upload.wikimedia.org' ) { [17:03:11] $multiVersion = MWMultiVersion::initializeForUploadWiki( $_SERVER['PATH_INFO'] ); [17:03:26] So it should be setup for said wiki [17:03:47] https://github.com/wikimedia/operations-mediawiki-config/blob/master/multiversion/MWVersion.php#L27-L28 [17:04:01] gi11es: the swift middleware rewrite will pass along some headers, so the image scaler should have an idea of what is handling [17:05:52] <^d> manybubbles: I'm not seeing any problems with zhwiki. [17:06:18] ^d: yay! [17:06:50] AndyRussG: the parts about login cookies and the parts about the 120s hit-for-pass on 0s objects are there. [17:07:27] AndyRussG: "sub vcl_deliver" in text-frontend.inc.vcl.erb has the part about inserted a new cache-control line for the outside world [17:07:56] (oh, now I see that's only for wiki pages, not API) [17:08:09] bblack: interesting! Yeah I can tell it's not simple [17:09:07] So in this context "text" means stuff like API's that output text (would that include HTML?) and back/frond end refers to setting caching on Varnish vs. caching instructions for the outside world? [17:09:17] AndyRussG: there's also another set of templates in modules/varnish/templates/vcl . And then varnish has its own default behaviors for each sub. The combined multiple definitions of each sub like vcl_fetch or vcl_deliver are concatenated (including the default) [17:09:58] in this context, "text" means our text varnish caches, which is the pool that handles things that aren't in the pools for "bits", "upload", "mobile" [17:10:13] huh! interesting [17:10:13] (or a few others at a different layer, e.g. parsoidcache) [17:10:47] at that layer, text/bits/upload/mobile is mostly about the public IP address the request hits, which differs by request hostname [17:11:09] man I'd wanna just attach a debugger on those... crazy layered compiled procedural config... things... [17:11:30] with great power comes great confusion! [17:11:36] hehe lol [17:12:10] well said, will re-use if license permits [17:12:39] getting lots of ZeroBanner errors [17:12:46] ^d: you seeing these? ^ [17:13:02] <^d> Wasn't looking :) [17:13:13] was just checking up one last time on zhwiki before lunch [17:13:41] <^d> fatal, yippie. [17:13:54] slowing down I think [17:14:20] who do we ping other than yurik? [17:14:25] he doesn't seem to be online [17:14:44] I think there's some other mobile people involved [17:14:59] What sort of fatals are they? [17:15:05] ie example? [17:15:27] <^d> call on non-object. [17:15:34] <^d> [2014-11-10 17:13:28] Fatal error: Call to a member function showImages() on a non-object at /srv/mediawiki/php-1.25wmf6/extensions/ZeroBanner/includes/PageRendering.php on line 345 [17:15:38] <^d> Same thing on all of them [17:15:43] greg-g: CI meeting tomorrow is cancelled, right? [17:15:48] Reedy: seems to have calmed downa bit though [17:16:12] !log elastic1001 finished restarting. letting is soak up shards for a few minutes to make sure restart was ok. then we'll plow through the others [17:16:17] Logged the message, Master [17:16:22] SpecialMobileEditWatchlist::images [17:16:30] Were there any changes in the job queue infrastructure recently? The ganglia graph on terbium has stopped almost two weeks ago: https://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20eqiad&h=terbium.eqiad.wmnet&r=hour&z=default&jr=&js=&st=1395860566&v=648583&m=Global_JobQueue_length&z=large [17:16:40] if ( $this->isZeroSubdomain() && !$this->config->showImages() ) { [17:17:08] (03CR) 10Ori.livneh: "> Should we have the Debian package to run the test on building?" [debs/pybal] - 10https://gerrit.wikimedia.org/r/172018 (owner: 10Ori.livneh) [17:17:52] <^d> gwicke: Nothing on mw/core side that Aaron did that I recall. I probably would've merged it. [17:18:02] manybubbles: ^d looks like the code should call getZeroConfig() to ensure $this->config is setup. I'll make a patch at least [17:18:18] Reedy: cool [17:18:29] <^d> It looks less brokey in master. [17:18:30] <^d> hmm [17:18:42] (03CR) 10Alexandros Kosiaris: [C: 032] swift: report statsd data to localhost [puppet] - 10https://gerrit.wikimedia.org/r/171547 (owner: 10Filippo Giunchedi) [17:18:46] <^d> maybe not, nvm [17:18:57] andrewbogott: yeppers, i should do that official in the calendar, thanks :) [17:19:20] https://gerrit.wikimedia.org/r/#/c/172278/ [17:19:46] ^d: the jobs are still being processed, and the lengths as reported by mwscript also look relatively normal; so maybe the hosts were swapped, or the ganglia monitoring has some issue [17:20:56] Wasn't there some ganglia changes a couple of weeks ago? [17:22:07] on October 24th uranium was promoted to prod ganglia it seems [17:22:34] after disk failure on nickel [17:23:16] the timing could fit [17:27:01] (03PS3) 10Ori.livneh: Add tests for pybal.util.ConfigDict [debs/pybal] - 10https://gerrit.wikimedia.org/r/172018 [17:27:12] (03PS4) 10Ori.livneh: Add tests for pybal.util.ConfigDict [debs/pybal] - 10https://gerrit.wikimedia.org/r/172018 [17:27:19] (03CR) 10Ori.livneh: [C: 032] Add tests for pybal.util.ConfigDict [debs/pybal] - 10https://gerrit.wikimedia.org/r/172018 (owner: 10Ori.livneh) [17:27:33] (03Merged) 10jenkins-bot: Add tests for pybal.util.ConfigDict [debs/pybal] - 10https://gerrit.wikimedia.org/r/172018 (owner: 10Ori.livneh) [17:28:28] (03PS3) 10Ori.livneh: Add .travis.yml file to enable automated tests on Travis CI [debs/pybal] - 10https://gerrit.wikimedia.org/r/172019 [17:29:12] (03CR) 10Ori.livneh: [C: 032] Add .travis.yml file to enable automated tests on Travis CI [debs/pybal] - 10https://gerrit.wikimedia.org/r/172019 (owner: 10Ori.livneh) [17:29:28] (03Merged) 10jenkins-bot: Add .travis.yml file to enable automated tests on Travis CI [debs/pybal] - 10https://gerrit.wikimedia.org/r/172019 (owner: 10Ori.livneh) [17:32:24] (03PS3) 10Ori.livneh: Add unit tests for `pybal.util.LogFile` [debs/pybal] - 10https://gerrit.wikimedia.org/r/172089 [17:32:30] (03CR) 10Ori.livneh: [C: 032] Add unit tests for `pybal.util.LogFile` [debs/pybal] - 10https://gerrit.wikimedia.org/r/172089 (owner: 10Ori.livneh) [17:32:44] (03Merged) 10jenkins-bot: Add unit tests for `pybal.util.LogFile` [debs/pybal] - 10https://gerrit.wikimedia.org/r/172089 (owner: 10Ori.livneh) [17:37:08] (03PS3) 10Ori.livneh: Add unit tests for pybal.ipvs.IPVSManager [debs/pybal] - 10https://gerrit.wikimedia.org/r/172102 [17:37:17] (03PS4) 10Ori.livneh: Add unit tests for pybal.ipvs.IPVSManager [debs/pybal] - 10https://gerrit.wikimedia.org/r/172102 [17:37:39] (03PS1) 10QChris: Link aggregator dataset into wikimetrics public webspace [puppet] - 10https://gerrit.wikimedia.org/r/172285 (https://bugzilla.wikimedia.org/72740) [17:39:28] (03CR) 10Ori.livneh: [C: 032] Add unit tests for pybal.ipvs.IPVSManager [debs/pybal] - 10https://gerrit.wikimedia.org/r/172102 (owner: 10Ori.livneh) [17:39:39] (03Merged) 10jenkins-bot: Add unit tests for pybal.ipvs.IPVSManager [debs/pybal] - 10https://gerrit.wikimedia.org/r/172102 (owner: 10Ori.livneh) [17:42:59] (03PS5) 10Ori.livneh: Add unit tests for pybal.ipvs.LVSService [debs/pybal] - 10https://gerrit.wikimedia.org/r/172206 [17:43:43] (03CR) 10QChris: Add jobs for aggregating hourly projectcount files to daily per wiki csvs (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/172201 (https://bugzilla.wikimedia.org/72740) (owner: 10QChris) [17:45:18] (03PS1) 10Ori.livneh: Add Travis CI build status to README [debs/pybal] - 10https://gerrit.wikimedia.org/r/172286 [17:45:27] ^d, Reedy: https://rt.wikimedia.org/Ticket/Display.html?id=8837 [17:45:36] (03CR) 10Ori.livneh: [C: 032] Add unit tests for pybal.ipvs.LVSService [debs/pybal] - 10https://gerrit.wikimedia.org/r/172206 (owner: 10Ori.livneh) [17:45:50] (03Merged) 10jenkins-bot: Add unit tests for pybal.ipvs.LVSService [debs/pybal] - 10https://gerrit.wikimedia.org/r/172206 (owner: 10Ori.livneh) [17:46:34] (03PS11) 10ArielGlenn: data retention audit script for logs, /root and /home dirs [software] - 10https://gerrit.wikimedia.org/r/141473 [17:46:43] (03CR) 10jenkins-bot: [V: 04-1] data retention audit script for logs, /root and /home dirs [software] - 10https://gerrit.wikimedia.org/r/141473 (owner: 10ArielGlenn) [17:46:56] (03CR) 10Ori.livneh: [C: 032] Add Travis CI build status to README [debs/pybal] - 10https://gerrit.wikimedia.org/r/172286 (owner: 10Ori.livneh) [17:47:10] (03Merged) 10jenkins-bot: Add Travis CI build status to README [debs/pybal] - 10https://gerrit.wikimedia.org/r/172286 (owner: 10Ori.livneh) [17:50:10] (03PS3) 10Filippo Giunchedi: jheapdump: gdb-based heap dump for JVM [puppet] - 10https://gerrit.wikimedia.org/r/170996 [17:51:10] (03CR) 10Filippo Giunchedi: "introduced java::tools class to include jheapdump, this should make things usable elsewhere too!" [puppet] - 10https://gerrit.wikimedia.org/r/170996 (owner: 10Filippo Giunchedi) [17:52:21] !log restart elastic1002 to pick up new plugins [17:52:25] Logged the message, Master [17:53:19] (03CR) 10Manybubbles: [C: 031] jheapdump: gdb-based heap dump for JVM [puppet] - 10https://gerrit.wikimedia.org/r/170996 (owner: 10Filippo Giunchedi) [17:59:10] (03PS1) 10Ori.livneh: Re-add explicit test loader in tests/__init__.py [debs/pybal] - 10https://gerrit.wikimedia.org/r/172290 [17:59:23] (03CR) 10Ori.livneh: [C: 032] Re-add explicit test loader in tests/__init__.py [debs/pybal] - 10https://gerrit.wikimedia.org/r/172290 (owner: 10Ori.livneh) [17:59:38] (03Merged) 10jenkins-bot: Re-add explicit test loader in tests/__init__.py [debs/pybal] - 10https://gerrit.wikimedia.org/r/172290 (owner: 10Ori.livneh) [18:09:08] (03PS1) 10Yuvipanda: admin: Add jdlrobson to researchers group [puppet] - 10https://gerrit.wikimedia.org/r/172292 [18:09:37] (03PS1) 10Ori.livneh: Add requirements.txt for Travis CI [debs/pybal] - 10https://gerrit.wikimedia.org/r/172293 [18:10:02] (03CR) 10Ori.livneh: [C: 032] Add requirements.txt for Travis CI [debs/pybal] - 10https://gerrit.wikimedia.org/r/172293 (owner: 10Ori.livneh) [18:10:16] (03Merged) 10jenkins-bot: Add requirements.txt for Travis CI [debs/pybal] - 10https://gerrit.wikimedia.org/r/172293 (owner: 10Ori.livneh) [18:10:52] (03CR) 10Ottomata: [C: 032] admin: Add jdlrobson to researchers group [puppet] - 10https://gerrit.wikimedia.org/r/172292 (owner: 10Yuvipanda) [18:14:30] (03CR) 10Ori.livneh: "one month ping" [puppet] - 10https://gerrit.wikimedia.org/r/165779 (owner: 10Ori.livneh) [18:15:14] ori: you don't have to keep pinging, it's on my radar [18:15:52] paravoid: ok :) i won't [18:17:54] hmm, puppet failure on stat1003 [18:18:28] from trebuchet [18:18:39] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: puppet fail [18:18:43] apergos: ^ [18:18:49] ImportError: No module named salt.client [18:19:11] orilly [18:19:54] apergos: yarly [18:23:50] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [18:23:59] call me blind but I looked in the log and saw nothing like tht [18:24:09] and a run just now (10 mins later?) looks fine, and look, there's the recovery [18:24:17] Error: /usr/local/sbin/grain-ensure set trebuchet_master tin.eqiad.wmnet returned 1 instead of one of [0] [18:24:17] Error: /Stage[main]/Role::Trebuchet/Salt::Grain[trebuchet_master]/Exec[ensure_trebuchet_master_tin.eqiad.wmnet]/returns: change from notrun to 0 failed: /usr/local/sbin/grain-ensure set trebuchet_master tin.eqiad.wmnet returned 1 instead of one of [0] [18:24:36] then running 'sudo /usr/local/sbin/grain-ensure set trebuchet_master tin.eqiad.wmnet' [18:24:38] produced that error [18:25:15] running that on which host please? [18:25:23] stat1003? [18:25:26] yeah [18:25:38] I am still getting it again, just ran puppet, it failed again [18:26:10] trebuchet_master: tin.eqiad.wmnet [18:26:12] it's set [18:26:42] root@stat1003:~# /usr/local/sbin/grain-ensure set trebuchet_master tin.eqiad.wmnet [18:26:42] root@stat1003:~# [18:27:25] I guess it must be your user and the sudo somehow [18:27:28] hmmm. [18:27:59] PROBLEM - puppet last run on stat1003 is CRITICAL: CRITICAL: Puppet has 1 failures [18:28:52] you want to sudo -s and try that grain-ensure again? [18:29:03] yeah, I'm doing that now [18:29:08] well, sudo -s and a puppet run [18:29:15] !log ori Synchronized php-1.25wmf7/includes/ChangeTags.php: Iec9befeba: Hide HHVM tag on Special:{Contributions,RecentChanges,...} (duration: 00m 06s) [18:29:16] ah, I would have done the script first, quicker [18:29:17] anyways [18:29:20] Logged the message, Master [18:29:31] hmm, still sudo -s [18:29:33] err [18:29:37] even with sudo -s it's failing. [18:29:55] works fine with sudo su [18:29:55] wtf [18:29:56] !log ori Synchronized php-1.25wmf6/includes/ChangeTags.php: Iec9befeba: Hide HHVM tag on Special:{Contributions,RecentChanges,...} (duration: 00m 05s) [18:29:59] Logged the message, Master [18:30:17] heh [18:30:35] ok well give me one minute to restart the minion because it's writing to a log file that was rotated out of the way and compressed [18:30:46] !log reboot db1017 to pick up an updated kernel [18:30:47] that way you'll at least be able to see what it doesn't like about you (maybe) [18:30:49] Logged the message, Master [18:30:54] heh, yeah [18:31:42] restarted [18:32:09] RECOVERY - puppet last run on stat1003 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [18:33:57] hmm, nothing there. [18:34:17] oh yeah, [18:34:23] I think I know what's happneing [18:35:10] fixed! [18:35:26] I was futzing around with my path earlier, for some py3 stuff, on stat1003. months ago... [18:37:04] what's with elasticsearch? [18:37:06] manybubbles? [18:38:15] <^d> Hmm? [18:38:21] <^d> up with? [18:38:56] WARNING - elasticsearch (production-search-eqiad) is running. status: yellow: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2117: active_shards: 6339: relocating_shards: 0: initializing_shards: 5: unassigned_shards: 23 [18:39:15] <^d> Ah, Nik bounced a node a bit ago to start a rolling restart. Maybe that's from that. [18:39:38] <^d> Green now, must be ok [18:40:28] anomie: hey, I'm working with AndyRussG on the CentralNotice changes that need to query the metawiki db while processing a bits request. U said earlier, "don't do what TimedMediaHandler does", is there an explanation I can read, somewhere? Seems like the TMH code in handlers/TextHandler/TextHandler.php is pretty much what I'd like to do... [18:42:16] !log restarting remaining elasticsearch boxes in sequence to pick up new plugins [18:42:20] Logged the message, Master [18:42:39] paravoid: plugins. I can create a downtime if you'd like so it doesn't page anyone [18:42:57] it didn't page anyone [18:43:12] I just happened to have the page open :) [18:43:39] is it at all on your roadmap to make our monitoring better? [18:43:49] paravoid: ah cool. I've been logging it in the SAL when I do one. [18:43:51] having one alert per server is a bit too noisy [18:43:52] awight: https://bugzilla.wikimedia.org/show_bug.cgi?id=59780#c6 has some explanation [18:43:56] yeah SAL works for me :) [18:44:03] anomie: great, thx [18:45:07] <^d> paravoid: Yeah, the check is a tad noisy :p [18:45:09] paravoid: _joe_ improved it a bit I believe but we're still in that one alert per server thing. its a pain but I'm not really sure what the right way to fix it is. we tried cooking something up that could monitor _cirrus_ but it was too complex for ops. [18:45:47] anomie: ok that makes sense, the Title thing was snafu, but what would you say about a straight db query from bits to meta? This is a reasonable thing to do, and much more efficient than an intracluster API call? [18:45:52] maybe we could make an api call that let cirrus spit out what it knows about the status of the cluster. so checking if it is up/down could just be hitting the api with ?nagios=true or something [18:46:31] <^d> manybubbles: problem with that means our check fails if api is down but search isn't :) [18:46:37] manybubbles: no, we shouldn't monitor ES via mediawiki [18:46:46] awight: A straight DB query is fine, IMO, as long as you're not caring to support setups that don't involve the local wiki being able to directly query the central wiki's DB. [18:47:28] (or if you have two ways to fetch the data) [18:47:39] anomie: good point, that would be nasty for some 3rd parties [18:48:00] paravoid: thats the trouble - if you want good feedback about what state _Cirrus_ is in (not elasticsearch) then you have to do complex stuff which is best suited to being run in cirrus itself. If you want to monitor the elasticsearch cluster you'll get all kinds of false positives about cirrus [18:48:23] awight: Fetch-via-API and direct-DB-query controlled by a config variable of some sort seems to be a more-or-less common pattern. [18:48:32] I'm not against a separate Cirrus check [18:48:40] but it's useful to know if the elasticsearch cluster by itself works or not [18:48:43] anomie: rad. we just have to remember to keep both code paths maintained and tested. [18:48:58] <^d> paravoid: Yeah, especially since we have non-Cirrus things using it :) [18:49:00] awight: Unit tests! (: [18:49:34] hehe /me tosses some salt over shoulder and skips off carefree, whistling. [18:51:08] paravoid: yeah - maybe we should make those checks less noisy. only one report for the whole cluster would be great. is that something icinga can do? [18:51:39] of course, it's a matter of configuration [18:51:54] you can e.g. have a single (remote) check from the icinga server [18:52:18] (03PS3) 10Ori.livneh: memcached: tidy [puppet] - 10https://gerrit.wikimedia.org/r/171153 [18:53:27] I think last time this was proposed there were some concerns about a split brain, though [18:56:41] manybubbles: if we were looking at latencies as observed by cirrus, would that be a good proxy? [18:56:57] it is essentially what users experience I think? [18:57:18] godog: only for cirrus. phab uses that cluster as well. so does translate iirc. but its a good start [18:57:40] (03PS1) 10Ori.livneh: Revert "hhvm: remove jemalloc profiling config due to a bug in HHVM" [puppet] - 10https://gerrit.wikimedia.org/r/172304 [18:57:42] <^d> translate will next week when we deploy :) [18:57:57] sure, we care so much more about search from wiki users tho [19:04:06] (03PS1) 10Ori.livneh: wmflib: make require_package() accept arrays [puppet] - 10https://gerrit.wikimedia.org/r/172305 [19:04:14] godog: ^ [19:05:12] (03PS2) 10Ori.livneh: wmflib: make require_package() accept arrays [puppet] - 10https://gerrit.wikimedia.org/r/172305 [19:05:44] ori: hah! you might want to add the wrong documentation that I removed too [19:07:18] (03PS3) 10Ori.livneh: wmflib: make require_package() accept arrays [puppet] - 10https://gerrit.wikimedia.org/r/172305 [19:08:13] godog: done. nothing to do with the arity, btw -- negative arities mean 'or more', and they're offset by one to allow '-1' to mean '0 or more'. so '-2' is '1 or more arguments' [19:09:31] ori: yeah I jumped the gun :( thanks for the proper fix :)) [19:09:50] np at all! thanks [19:11:23] forgot to do that earlier [19:22:28] (03PS1) 10Dereckson: Gerrit also listens on port 22 [puppet] - 10https://gerrit.wikimedia.org/r/172313 (https://bugzilla.wikimedia.org/35611) [19:24:22] (03CR) 10Dereckson: [C: 04-1] "Requires first ybertium sshd doesn't listen on 208.80.154.81" [puppet] - 10https://gerrit.wikimedia.org/r/172313 (https://bugzilla.wikimedia.org/35611) (owner: 10Dereckson) [19:25:18] http://status.wikimedia.org/ -- dns has been 'slow' for a week.. ? [19:25:53] bblack: ^ [19:25:58] Have you tried turning it off and on again? [19:28:13] (03PS1) 10Ori.livneh: Add https://github.com/facebook/hhvm/pull/4199 [debs/hhvm] - 10https://gerrit.wikimedia.org/r/172314 [19:32:20] PROBLEM - puppet last run on tungsten is CRITICAL: CRITICAL: puppet fail [19:34:14] hmm [19:34:49] PROBLEM - ElasticSearch health check on elastic1004 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.111 [19:35:37] (03PS1) 10Ori.livneh: Add [debs/hhvm] (master_330) - 10https://gerrit.wikimedia.org/r/172315 [19:36:18] (03Abandoned) 10Ori.livneh: Add https://github.com/facebook/hhvm/pull/4199 [debs/hhvm] - 10https://gerrit.wikimedia.org/r/172314 (owner: 10Ori.livneh) [19:36:57] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] Add [debs/hhvm] (master_330) - 10https://gerrit.wikimedia.org/r/172315 (owner: 10Ori.livneh) [19:37:30] RECOVERY - puppet last run on tungsten is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [19:37:49] ^ transient failure, I re ran and things ar eok [19:37:50] *are [19:39:13] hmm [19:39:13] Error: Failed to apply catalog: Could not find dependency Package[graphite-web] for Exec[graphite_syncdb] at /etc/puppet/modules/graphite/manifests/web.pp:89 [19:40:00] <_joe_> godog: ^^ transiently so [19:40:11] <_joe_> :/ [19:40:35] <_joe_> I should take a better look at the require-package function and puppet internals [19:42:00] <_joe_> also ori may be interested [19:42:36] hrm, i'll look [19:46:44] <_joe_> ori: require_package sometimes doesn't work with require => Package['x'] [19:46:56] <_joe_> I think it's a dumbass race condition in the puppet core [19:47:27] _joe_: i gotta run for 20 mins or so, is it breaking or can i look at it then? [19:48:03] <_joe_> no you have all the time you want :) [19:50:37] cajoel: I don't think that DNS latency in nimsoft is real, it's probably a monitoring issue. [19:51:36] does this imply that it's testing zone transfers? http://status.wikimedia.org/8777/155942/DNS [19:51:43] 'transfer' ? [19:53:32] I have no idea, but we don't do zone transfers (and haven't for a very long time) [19:54:53] who knows the nimsoft setup? [20:00:22] cajoel: nimsoft wasn't updated for the ns1 IP change that happened a while back, so the "latency" was it timing out and failing on 1/3 server IPs (which seems like a dumb way to structure the check, we probably want it to turn red if it can't hit a DNS server at all). [20:00:28] but in any case, I updated the wrong IP [20:00:42] sweet [20:02:20] ori _joe_ thanks! [20:06:52] RECOVERY - ElasticSearch health check on elastic1004 is OK: OK - elasticsearch (production-search-eqiad) is running. status: green: timed_out: false: number_of_nodes: 31: number_of_data_nodes: 31: active_primary_shards: 2117: active_shards: 6367: relocating_shards: 16: initializing_shards: 0: unassigned_shards: 0 [20:12:11] mutante: greg-g hexmode do you know who is responsible for the bugzilla reporter/stats email that's sent every week? [20:12:50] YuviPanda: i do, it's a script and cron, puppetized [20:13:17] mutante: ah, ok. can you silence the script? sends out corn with a few hundred 'progress' lines for every successful run :) [20:14:25] (03CR) 10Faidon Liambotis: [C: 04-1] "One small inline comment and one larger note that I think I've asked before and forgot: it seems to me like a great deal of work has gone " (031 comment) [puppet/cassandra] - 10https://gerrit.wikimedia.org/r/166888 (owner: 10Ottomata) [20:14:26] YuviPanda: eh, let's fix it instead if silence it [20:14:43] YuviPanda: where do you see them? [20:14:45] mutante: sure, we can just not make it output things when things don't error [20:15:08] mutante: forwarding [20:15:20] mutante: forwarded [20:16:04] mutante: hmm, gmail is still at 'working' [20:16:45] YuviPanda: i see the normal weekly reporter script i keep getting [20:16:50] did it come from labs ? [20:16:58] the prod. one comes from zirconium [20:17:05] mutante: no, it came from zirconium [20:17:09] mutante: it just came from cron [20:17:10] sent to root [20:17:31] mutante: ok, it finished sending you the message. was huge [20:17:44] oooh, _that_ one ? [20:17:46] Cron cd /srv/org/wikimedia/bugzilla; ./collectstats.pl --regenerate [20:17:48] yeah [20:17:52] that's a different one, i get it now [20:17:58] aaah [20:18:01] it isn't the reporter? [20:18:22] i was talking about "Bugzilla Weekly Report" from reporter@ [20:18:38] nope, it's a different cron.. i'll fix it [20:18:46] ah, cool [20:18:48] thanks, mutante [20:19:03] np [20:19:28] (03PS1) 10Legoktm: [WIP] Deploy BounceHandler extension to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172322 (https://bugzilla.wikimedia.org/69019) [20:21:31] yea, it's actually not failing, legit script run, just that we don't need to have the output via mail really [20:22:24] (03PS2) 10Hashar: tox env to build test coverage [debs/pybal] - 10https://gerrit.wikimedia.org/r/172243 [20:23:44] yeah [20:23:54] mutante: plus it's a *large* email [20:24:04] mutante: and cleaning up cronspam means more opsen can read it :) [20:24:29] (03PS1) 10Dzahn: bugzilla - silence collectstats cron jobs [puppet] - 10https://gerrit.wikimedia.org/r/172324 [20:24:36] (03CR) 10jenkins-bot: [V: 04-1] bugzilla - silence collectstats cron jobs [puppet] - 10https://gerrit.wikimedia.org/r/172324 (owner: 10Dzahn) [20:24:41] yes, YuviPanda, it makes much sense to reduce cron spam,, there, silencing it [20:24:54] mutante: :D cool [20:25:05] (03PS2) 10Dzahn: bugzilla - silence collectstats cron jobs [puppet] - 10https://gerrit.wikimedia.org/r/172324 [20:25:28] (03CR) 10Yuvipanda: [C: 031] bugzilla - silence collectstats cron jobs [puppet] - 10https://gerrit.wikimedia.org/r/172324 (owner: 10Dzahn) [20:25:37] mutante: hmm, should we only silence output? [20:25:43] mutante: errors should probably make it through [20:25:50] (03CR) 10Reedy: [WIP] Deploy BounceHandler extension to production (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172322 (https://bugzilla.wikimedia.org/69019) (owner: 10Legoktm) [20:26:08] YuviPanda: .. we don't care anymore [20:26:14] hehe, ok then [20:27:35] YuviPanda: there's a bonus comment in the BZ docs [20:27:38] "You normally don't need to use this option (do not use it in a cron job)." [20:27:45] hehe [20:27:52] but it was there since forever, i just puppetized manual things at some point [20:28:09] http://www.bugzilla.org/docs/4.4/en/html/api/collectstats.html [20:28:23] i think the only consumer of this data was qgil [20:28:46] (03CR) 10Ori.livneh: [C: 031] "LGTM" [debs/pybal] - 10https://gerrit.wikimedia.org/r/172243 (owner: 10Hashar) [20:29:16] hehe [20:29:31] (03CR) 10Reedy: [WIP] Deploy BounceHandler extension to production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172322 (https://bugzilla.wikimedia.org/69019) (owner: 10Legoktm) [20:30:00] (03CR) 10Ori.livneh: [C: 032] Revert "hhvm: remove jemalloc profiling config due to a bug in HHVM" [puppet] - 10https://gerrit.wikimedia.org/r/172304 (owner: 10Ori.livneh) [20:30:38] (03CR) 10Yuvipanda: [C: 032] bugzilla - silence collectstats cron jobs [puppet] - 10https://gerrit.wikimedia.org/r/172324 (owner: 10Dzahn) [20:40:22] Reedy: can you please review https://gerrit.wikimedia.org/r/#/c/172112/ ? [20:50:13] Reedy: shouldn't the BH config have the IP of the mailserver, not the app servers? [20:50:35] Oh.. [20:50:40] (03PS1) 10Yuvipanda: Pass in config object to generators [software/shinkengen] - 10https://gerrit.wikimedia.org/r/172335 [20:50:46] polonium's IP, I guess [20:51:12] yeah.. [20:54:07] 208.80.154.90 [20:54:59] (03PS2) 10Yuvipanda: Pass in config object to generators [software/shinkengen] - 10https://gerrit.wikimedia.org/r/172335 [20:58:22] (03CR) 10Hashar: "*hint* https://www.mediawiki.org/wiki/Continuous_integration/Tutorials/Test_your_python *hint*" [software/shinkengen] - 10https://gerrit.wikimedia.org/r/172335 (owner: 10Yuvipanda) [20:58:33] hashar: :) will do [20:58:43] YuviPanda: :D [20:58:45] well, will at least setup linting :) [21:00:05] gwicke, cscott, subbu: Respected human, time to deploy Parsoid/OCG (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20141110T2100). Please do the needful. [21:02:33] Reedy: and wikishared is on extension1 right? [21:02:38] ya [21:03:27] (03PS2) 10Legoktm: [WIP] Deploy BounceHandler extension to production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/172322 (https://bugzilla.wikimedia.org/69019) [21:03:36] ok, just needs the thing to be set in PrivateSettings now [21:04:38] paravoid: in 3183df9602582ddd43a672e9f54c143932ad7ae8 you set python-memcache to ensure=>absent. Safe to assume that was cleanup and not that that package actually needs to be missing for things to work properly? [21:04:46] (I realize that patch is from ages ago; sorry) [21:05:19] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [500.0] [21:13:21] (03PS3) 10Ori.livneh: hhvm: ensure that jemalloc heap profiling is disabled. [puppet] - 10https://gerrit.wikimedia.org/r/171515 [21:13:21] arlolra and i are doing the Parsoid deploy today [21:13:30] (03CR) 10Reedy: mediawiki: simplify apache config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/170300 (owner: 10Giuseppe Lavagetto) [21:14:30] PROBLEM - HTTP error ratio anomaly detection on tungsten is CRITICAL: CRITICAL: Anomaly detected: 10 data above and 9 below the confidence bounds [21:14:47] there was an exception spike [21:15:41] RECOVERY - HTTP error ratio anomaly detection on tungsten is OK: OK: No anomaly detected [21:16:04] (03PS4) 10Ori.livneh: hhvm: ensure that jemalloc heap profiling is disabled. [puppet] - 10https://gerrit.wikimedia.org/r/171515 [21:19:12] (03CR) 10Ori.livneh: [C: 032 V: 032] "By being made conditional on a check of jemalloc-stats-print, this doesn't need to run on a schedule, but only when the server state devia" [puppet] - 10https://gerrit.wikimedia.org/r/171515 (owner: 10Ori.livneh) [21:21:23] (03CR) 10Yuvipanda: [C: 032 V: 032] Pass in config object to generators [software/shinkengen] - 10https://gerrit.wikimedia.org/r/172335 (owner: 10Yuvipanda) [21:21:35] (03PS1) 10Ori.livneh: Fix-up for I4f3534ea2 [puppet] - 10https://gerrit.wikimedia.org/r/172405 [21:21:39] (03PS1) 10Ottomata: Add new ssh key for Otto's new laptop [puppet] - 10https://gerrit.wikimedia.org/r/172406 [21:21:42] (03PS2) 10Ori.livneh: Fix-up for I4f3534ea2 [puppet] - 10https://gerrit.wikimedia.org/r/172405 [21:21:48] (03CR) 10Ori.livneh: [C: 032 V: 032] Fix-up for I4f3534ea2 [puppet] - 10https://gerrit.wikimedia.org/r/172405 (owner: 10Ori.livneh) [21:22:04] (03PS2) 10Ottomata: Add new ssh key for Otto's new laptop [puppet] - 10https://gerrit.wikimedia.org/r/172406 [21:23:29] (03CR) 10Ottomata: [C: 032] Add new ssh key for Otto's new laptop [puppet] - 10https://gerrit.wikimedia.org/r/172406 (owner: 10Ottomata) [21:23:46] ottomata: want me to puppet merge? [21:25:19] i got my other computer open next to me, so i think i got it [21:25:52] thanks YuviPanda [21:25:57] ottomata: aah, cool :) [21:29:10] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [21:30:40] PROBLEM - Disk space on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:30:44] PROBLEM - check configured eth on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:30:47] PROBLEM - puppet last run on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:30:49] PROBLEM - SSH on stat1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:31:08] (03CR) 10Dzahn: [C: 031] "magnesium already has an IPv6 address on eth0 (besides link-local), just not a mapped one, so makes sense" [puppet] - 10https://gerrit.wikimedia.org/r/172179 (owner: 10John F. Lewis) [21:31:11] PROBLEM - RAID on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:31:20] PROBLEM - DPKG on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:31:39] PROBLEM - check if salt-minion is running on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:31:41] PROBLEM - check if dhclient is running on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:31:41] ottomata: ^ .. uhm.. it looks down [21:31:44] trying mgmt [21:32:13] it looks like something yeah [21:32:15] i was about to do that [21:32:21] discussion in #wikimedia-research [21:32:24] about folks hammering the machine [21:32:28] it mgith jsut be 100 % loade dup [21:32:29] [2263326.770105] [21:32:33] the only output i see [21:32:38] hm: i-000005d2.eqiad.wmflabs is not completing its git-deploy fetch [21:32:41] looks frozen [21:32:47] any git-deploy gurus who can help? [21:32:55] hm [21:32:58] dunno [21:33:08] ottomata: that state makes me want to powercycle, nothing on mgmt [21:33:14] no reaction [21:33:29] Ironholds: ^ [21:33:38] yeah, 100% CPU https://graphite.wikimedia.org/render/?width=586&height=308&_salt=1415655208.739&target=servers.stat1002.cpu.total.user.value&target=servers.stat1002.cpu.total.system.value [21:33:48] for close to 12h? [21:34:19] yea, so there would still be a login on mgmt though [21:34:33] so, I actually managed through some chicken-sacrifice voodoo to SSH in and top [21:34:50] main culprits appear to be an infinite set of turtle-I mean, processes, from ellery, EZ and "stats", whatever that is. [21:35:11] you can't actually kill them, can you? don't have the rights, I think? [21:35:14] stats.wikimedia.org is an alias for stat1001.wikimedia.org. [21:35:18] not 1002 [21:35:34] YuviPanda, well, more importantly even if I could, I can't get into the machine any more [21:35:39] heh [21:35:54] i can't either, also not on drac, it needs cycling [21:36:01] <^d> I see an Ironholds on IRC!! [21:36:11] so, other than ewulczyn's kafkacat stuff, we /might/ lose the output of EZ's scripts for a day [21:36:22] but we can email him and tell him "this broke while one of your things was running, go rerun it". [21:36:25] ^d: someone from research broke stat1002, Ironholds is from research, and hence he broke it! :) [21:36:29] (and if it breaks again at least we know what happened) [21:36:42] ^d, I'm on IRC all the time, just not -staff. [21:36:48] !log powercycling frozen stat1002 [21:36:55] Logged the message, Master [21:36:57] <^d> Ironholds: I /whois you but your list of channels is always empty for me [21:37:21] Which only means you don't share channels [21:37:24] Configuring memory. Please wait... [21:37:36] it is coming back to life slowly [21:37:41] mutante/ottomata: [21:37:46] <^d> Ironholds: I have a youtube video for you! https://www.youtube.com/watch?v=facVh75-vW4 [21:37:49] fell free to kill my processes if you are able to [21:37:52] non, stats is a user [21:37:55] on the stats machines' [21:37:55] I'm gonna send an email to stat1002 noting "PLEASE FOR THE LOVE OF GOD TELL US IF YOU'RE GOING TO DO THIS" [21:37:56] <^d> (it's very important you watch that video Ironholds) [21:38:06] system user, it runs cron jobs, etc. [21:38:06] if you want to send a detailed "and here's what happened" when we know? [21:38:10] ^d: show up at #wikimedia-research for all your Ironholds needs [21:38:12] ottomata, aha. [21:38:22] <^d> YuviPanda: I have enough channels as it is! [21:38:29] ^d: clearly not [21:38:32] if you thought Ironholds was off IRC :) [21:38:45] <^d> Ironholds is just in the wrong channels! [21:39:47] (03PS1) 10Jgreen: attempt to isolate and squelch harmless warning in OTRS ticket-mail export function [puppet] - 10https://gerrit.wikimedia.org/r/172411 [21:39:55] I'm in the right channels. Anyway, this is an irrelevancy. [21:40:01] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 21 minutes ago with 0 failures [21:40:06] Ironholds: @stat1002:~# uptime 21:40:00 up 0 min [21:40:12] RECOVERY - SSH on stat1002 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [21:40:19] yay! [21:40:32] RECOVERY - RAID on stat1002 is OK: OK: optimal, 1 logical, 12 physical [21:40:37] <^d> Ironholds: The video I shared is not irrelevant. [21:40:40] <^d> :D [21:40:46] okay. I'll start an email thread about the "please tell us what you're doing when it's high-throughput" [21:40:50] RECOVERY - DPKG on stat1002 is OK: All packages OK [21:40:51] RECOVERY - check if dhclient is running on stat1002 is OK: PROCS OK: 0 processes with command name dhclient [21:40:51] RECOVERY - check if salt-minion is running on stat1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [21:40:57] would suggest a general incident report at least for analytics-internal so we can work out what happened and how to stop it. [21:41:00] RECOVERY - check configured eth on stat1002 is OK: NRPE: Unable to read output [21:41:00] RECOVERY - Disk space on stat1002 is OK: DISK OK [21:41:35] (03CR) 10Jgreen: [C: 032 V: 031] attempt to isolate and squelch harmless warning in OTRS ticket-mail export function [puppet] - 10https://gerrit.wikimedia.org/r/172411 (owner: 10Jgreen) [21:43:42] (03PS2) 10Reedy: $wgContentHandlerUseDB true everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/170129 (https://bugzilla.wikimedia.org/49193) (owner: 10Spage) [21:43:51] <^d> Ironholds: Have you watched that video yet? I thought of you especially when I saw it :) [21:44:43] ^d, I have! [21:44:51] I just wanted to hop in to confirm that I'm down for a stat1003 power cycle. [21:44:56] I won't be affected. [21:45:05] <^d> Ironholds: I miss John Madden :p [21:45:20] halfak: 1002! :) already happened [21:45:32] OK. Still good. [21:45:35] :) [21:46:40] so, i think it was this, right [21:46:52] import_record_impression.py [21:47:07] mutante: thanks [21:47:31] (03PS11) 10Andrew Bogott: Add class and role for Openstack Horizon [puppet] - 10https://gerrit.wikimedia.org/r/170340 [21:47:33] (03PS1) 10Andrew Bogott: No longer ensure => absent package python-memcache [puppet] - 10https://gerrit.wikimedia.org/r/172413 [21:47:44] ottomata: yw, @stat1002:~# grep import_record_impression /var/log/syslog [21:48:12] (03PS1) 10Yuvipanda: Put project name in 'notes' field of host [software/shinkengen] - 10https://gerrit.wikimedia.org/r/172414 [21:49:47] !log updated OCG to version d9855961b18f550f62c0b20da70f95847a215805 [21:49:50] Logged the message, Master [21:53:23] mutante, email sent to analytics-internal and research-internal, proposing a few guidelines ("maximum of 4 cores", "tell people first", "no nesting parallelisation") [21:53:29] hopefully that will help? Sorry about this! [21:54:19] thanks Ironholds [21:54:32] Ironholds: thank you, sounds good. so i basically just powercycled it, but there wasn't much choice, mgmt did not give me a login [21:54:41] yup. It was not a happy bunny :( [21:55:06] alright, I'mma retreat to my research cave. I have to copyedit "User sessions identification based on strong regularities in inter-activity time" which I promise is more fun than it sounds [21:55:07] * Ironholds waves [22:01:35] (03CR) 10Dzahn: [C: 032] OCG configuration: turn off now-unnecessary transition hack. [puppet] - 10https://gerrit.wikimedia.org/r/171579 (owner: 10Cscott) [22:05:08] paravoid: yt? [22:08:39] (03PS2) 10Yuvipanda: Put project name in 'notes' field of host [software/shinkengen] - 10https://gerrit.wikimedia.org/r/172414 [22:08:48] (03CR) 10Yuvipanda: [C: 032 V: 032] Put project name in 'notes' field of host [software/shinkengen] - 10https://gerrit.wikimedia.org/r/172414 (owner: 10Yuvipanda) [22:11:03] (03CR) 10Ottomata: Link aggregator dataset into wikimetrics public webspace (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/172285 (https://bugzilla.wikimedia.org/72740) (owner: 10QChris) [22:12:20] PROBLEM - ElasticSearch health check on elastic1006 is CRITICAL: CRITICAL - Could not connect to server 10.64.0.113 [22:12:49] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [22:12:54] (03CR) 10Dzahn: [C: 032] "magnesium already has an IPv6 address on eth0 (besides link-local), just not a mapped one, so makes sense. better mapped than not. no actu" [puppet] - 10https://gerrit.wikimedia.org/r/172179 (owner: 10John F. Lewis) [22:13:58] !log updated Parsoid to version b61475196 [22:14:02] Logged the message, Master [22:15:17] (03CR) 10Dzahn: "inet6 2620:0:861:1:208:80:154:5/64" [puppet] - 10https://gerrit.wikimedia.org/r/172179 (owner: 10John F. Lewis) [22:15:48] YuviPanda: Could not generate documentation: Definition 'monitoring::group' is already defined at /srv/org/wikimedia/doc/puppetsource/modules/monitoring/manifests/group.pp:13; cannot be redefined at /srv/org/wikimedia/doc/puppetsource/manifests/nagios.pp:139 [22:16:01] that one makes it so that [22:16:03] operations-puppet-doc FAILURE in 35s [22:16:22] hmm, I think _joe_ did the moving...? [22:16:28] (03PS1) 10Ori.livneh: Allow multiple instances [puppet/varnishkafka] - 10https://gerrit.wikimedia.org/r/172418 [22:18:03] modules/monitoring/ ? vs. modules/nagios_common/ [22:19:09] mutante: hmm, one of them should die. [22:19:10] monitoring, nagios_common, icinga, shinken [22:19:12] I'm unsure which [22:20:14] also on the subject of killing modules... [22:20:14] https://gerrit.wikimedia.org/r/#/c/170974/ [22:20:36] renames nagios_common to icinga_common [22:21:02] well, icinga_shinken_common :) [22:21:25] shinken -> Schinken [22:21:34] http://en.wiktionary.org/wiki/Schinken [22:22:46] +2 [22:23:21] :D [22:24:34] NO [22:24:36] haha [22:25:30] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [22:25:34] (03PS1) 10Yuvipanda: [WIP] shinken: Add basic service checks for all of labs [puppet] - 10https://gerrit.wikimedia.org/r/172420 [22:26:17] (03CR) 10jenkins-bot: [V: 04-1] [WIP] shinken: Add basic service checks for all of labs [puppet] - 10https://gerrit.wikimedia.org/r/172420 (owner: 10Yuvipanda) [22:27:08] (03PS2) 10Yuvipanda: [WIP] shinken: Add basic service checks for all of labs [puppet] - 10https://gerrit.wikimedia.org/r/172420 [22:29:36] (03PS2) 10QChris: Link aggregator dataset into wikimetrics public webspace [puppet] - 10https://gerrit.wikimedia.org/r/172285 (https://bugzilla.wikimedia.org/72740) [22:30:26] (03CR) 10QChris: Link aggregator dataset into wikimetrics public webspace (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/172285 (https://bugzilla.wikimedia.org/72740) (owner: 10QChris) [22:30:50] (03PS3) 10Yuvipanda: [WIP] shinken: Add basic service checks for all of labs [puppet] - 10https://gerrit.wikimedia.org/r/172420 [22:32:36] (03CR) 10QChris: Link aggregator dataset into wikimetrics public webspace (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/172285 (https://bugzilla.wikimedia.org/72740) (owner: 10QChris) [22:37:00] (03PS4) 10Yuvipanda: [WIP] shinken: Add basic service checks for all of labs [puppet] - 10https://gerrit.wikimedia.org/r/172420 [22:38:44] (03PS5) 10Yuvipanda: [WIP] shinken: Add basic service checks for all of labs [puppet] - 10https://gerrit.wikimedia.org/r/172420 [22:38:46] (03PS1) 10Yuvipanda: shinken: Fix typo in previous commit fixing typo [puppet] - 10https://gerrit.wikimedia.org/r/172423 [22:52:01] (03PS6) 10Yuvipanda: [WIP] shinken: Add basic service checks for all of labs [puppet] - 10https://gerrit.wikimedia.org/r/172420 [22:52:43] (03CR) 10Yuvipanda: [C: 032] shinken: Fix typo in previous commit fixing typo [puppet] - 10https://gerrit.wikimedia.org/r/172423 (owner: 10Yuvipanda) [22:53:07] (03CR) 10QChris: Link aggregator dataset into wikimetrics public webspace (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/172285 (https://bugzilla.wikimedia.org/72740) (owner: 10QChris) [22:56:49] (03PS1) 10John F. Lewis: icinga-admin cname to neon [dns] - 10https://gerrit.wikimedia.org/r/172430 [22:57:06] mutante: ^^ [22:58:48] (03PS7) 10Yuvipanda: [WIP] shinken: Add basic service checks for all of labs [puppet] - 10https://gerrit.wikimedia.org/r/172420 [23:00:39] (03CR) 10Ottomata: Link aggregator dataset into wikimetrics public webspace (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/172285 (https://bugzilla.wikimedia.org/72740) (owner: 10QChris) [23:04:21] (03CR) 10QChris: [C: 04-1] "ottomata and I discussed this change in IRC [1]." [puppet] - 10https://gerrit.wikimedia.org/r/172201 (https://bugzilla.wikimedia.org/72740) (owner: 10QChris) [23:05:46] (03PS1) 10John F. Lewis: map ip6 on uranium [puppet] - 10https://gerrit.wikimedia.org/r/172432 [23:05:56] (03PS2) 10John F. Lewis: map ip6 on uranium [puppet] - 10https://gerrit.wikimedia.org/r/172432 [23:06:24] (03CR) 10Dzahn: [C: 032] "that's true, no reason for icinga-admin to be different from icinga" [dns] - 10https://gerrit.wikimedia.org/r/172430 (owner: 10John F. Lewis) [23:06:47] mutante: thanks - mind looking at https://gerrit.wikimedia.org/r/#/c/172432/ :) [23:06:56] another map ip6 change [23:07:59] (03CR) 10Dzahn: "icinga-admin.wikimedia.org is an alias for neon.wikimedia.org. ping6 icinga-admin.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/172430 (owner: 10John F. Lewis) [23:13:08] (03CR) 10Dzahn: [C: 032] "yes, this also already has an IPv6 on eth0, just not a mapped one, so better mapped than not mapped, no actual AAAA record for uranium yet" [puppet] - 10https://gerrit.wikimedia.org/r/172432 (owner: 10John F. Lewis) [23:13:25] (03PS1) 10Ori.livneh: hhvm::debug: add apache2-utils [puppet] - 10https://gerrit.wikimedia.org/r/172433 [23:13:50] (03CR) 10Dzahn: "uranium could also use some base::firewall, btw" [puppet] - 10https://gerrit.wikimedia.org/r/172432 (owner: 10John F. Lewis) [23:15:40] PROBLEM - ElasticSearch health check on elastic1007 is CRITICAL: CRITICAL - Could not connect to server 10.64.32.139 [23:17:50] PROBLEM - ElasticSearch health check for shards on elastic1007 is CRITICAL: CRITICAL - elasticsearch http://10.64.32.139:9200/_cluster/health error while fetching: Request timed out. [23:17:51] (03PS1) 10John F. Lewis: include firewall on uranium [puppet] - 10https://gerrit.wikimedia.org/r/172434 [23:18:02] (03PS2) 10John F. Lewis: include firewall on uranium [puppet] - 10https://gerrit.wikimedia.org/r/172434 [23:18:12] mutante: ^^ :) [23:18:47] BREAK ALL THE THINGS [23:18:49] RECOVERY - ElasticSearch health check for shards on elastic1007 is OK: OK - elasticsearch status production-search-eqiad: status: yellow, number_of_nodes: 31, unassigned_shards: 209, timed_out: False, active_primary_shards: 2117, cluster_name: production-search-eqiad, relocating_shards: 16, active_shards: 6158, initializing_shards: 0, number_of_data_nodes: 31 [23:18:59] Reedy :p [23:22:33] !log reprepro: include src:libmaxminddb, src:geoipupdate for precise/trusty [23:22:38] Logged the message, Master [23:31:17] (03PS2) 10Andrew Bogott: No longer ensure => absent package python-memcache [puppet] - 10https://gerrit.wikimedia.org/r/172413 [23:31:19] (03PS12) 10Andrew Bogott: Add class and role for Openstack Horizon [puppet] - 10https://gerrit.wikimedia.org/r/170340 [23:32:28] (03CR) 10Dzahn: [C: 04-1] "first need to make sure there are holes in the firewall for ganglia and whatever else might be on the host" [puppet] - 10https://gerrit.wikimedia.org/r/172434 (owner: 10John F. Lewis) [23:36:31] (03CR) 10Dzahn: "inet6 2620:0:861:1:208:80:154:53/64" [puppet] - 10https://gerrit.wikimedia.org/r/172432 (owner: 10John F. Lewis) [23:41:49] PROBLEM - puppet last run on mw1013 is CRITICAL: CRITICAL: Puppet has 1 failures [23:42:00] PROBLEM - puppet last run on mw1096 is CRITICAL: CRITICAL: Puppet has 2 failures [23:43:12] PROBLEM - puppet last run on mw1109 is CRITICAL: CRITICAL: Puppet has 1 failures [23:45:50] RECOVERY - puppet last run on mw1013 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [23:48:20] (03PS1) 10John F. Lewis: add AAAA for uranium [dns] - 10https://gerrit.wikimedia.org/r/172442 [23:48:31] (03PS2) 10John F. Lewis: add AAAA for uranium [dns] - 10https://gerrit.wikimedia.org/r/172442 [23:56:16] (03PS1) 10Faidon Liambotis: geoip: switch data::maxmind to geoiupdate [puppet] - 10https://gerrit.wikimedia.org/r/172444 [23:56:39] (03PS2) 10Faidon Liambotis: geoip: switch data::maxmind to geoipupdate [puppet] - 10https://gerrit.wikimedia.org/r/172444 [23:58:14] oh man [23:58:45] <^d> ebernhardson: ping for swat. [23:58:45] <% if @proxy then %> [23:58:48] clearly too late... [23:58:58] I claim SWAT today [23:59:11] Because of the sheer number of VE changes [23:59:11] (03PS3) 10Faidon Liambotis: geoip: switch data::maxmind to geoipupdate [puppet] - 10https://gerrit.wikimedia.org/r/172444 [23:59:34] <^d> RoanKattouw: They're not on-wiki :) [23:59:38] Adding now [23:59:45] But I also need to update my commit to add more [23:59:53] <^d> But it's all yours if you're volunteering :)