[00:01:42] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 1703 bytes in 5.514 second response time [00:02:39] cp1053 looks hosed [00:08:42] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 139011 bytes in 5.750 second response time [00:12:41] mark & paravoid, cp1053 looks borken [00:13:00] bblack, ^ [00:34:52] RECOVERY - Varnish traffic logger on cp1053 is OK: PROCS OK: 2 processes with command name varnishncsa [00:35:12] RECOVERY - Varnish HTTP text-backend on cp1053 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.000 second response time [00:35:22] RECOVERY - Varnish HTCP daemon on cp1053 is OK: PROCS OK: 1 process with UID = 111 (vhtcpd), args vhtcpd [01:08:33] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: reqstats.5xx [crit=500.000000 [02:01:12] PROBLEM - Puppet freshness on dysprosium is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 07:45:00 PM UTC [02:07:33] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: reqstats.5xx [warn=250.000 [02:27:23] !log LocalisationUpdate completed (1.23wmf13) at 2014-02-17 02:27:22+00:00 [02:27:43] Logged the message, Master [02:37:40] !log LocalisationUpdate completed (1.23wmf14) at 2014-02-17 02:37:40+00:00 [02:37:47] Logged the message, Master [03:19:29] !log LocalisationUpdate ResourceLoader cache refresh completed at 2014-02-17 03:19:28+00:00 [03:19:35] Logged the message, Master [03:19:45] ... [03:20:05] that took 40 minutes? [03:22:00] Hmm [03:22:17] 40 minutes 2 days ago [03:22:21] 20 minutes yesterday [03:22:49] has been for a while :/ [05:02:12] PROBLEM - Puppet freshness on dysprosium is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 07:45:00 PM UTC [05:15:21] (03CR) 10Andrew Bogott: [C: 032] Add cron entries to update puppet repos on labs puppetmasters. [operations/puppet] - 10https://gerrit.wikimedia.org/r/113332 (owner: 10Andrew Bogott) [08:03:12] PROBLEM - Puppet freshness on dysprosium is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 07:45:00 PM UTC [08:43:59] good morning [08:44:14] morning hashar [08:44:25] bah gotta reboot CPU hot again [08:44:27] brb [08:46:46] !log Upgrading Jenkins, half an hour downtime [08:46:54] Logged the message, Master [08:47:07] paravoid: ^^^^ :-D [08:50:10] good luck [08:50:22] PROBLEM - jenkins_service_running on gallium is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [08:51:28] ACKNOWLEDGEMENT - jenkins_service_running on gallium is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war ori.livneh hashar performing a scheduled upgrade [08:53:57] ohh [08:54:23] I have two jenkins running [08:55:11] +4! [08:55:23] (2 * +2) [08:56:22] RECOVERY - jenkins_service_running on gallium is OK: PROCS OK: 1 process with regex args ^/usr/bin/java .*-jar /usr/share/jenkins/jenkins.war [08:56:25] \O/ [09:03:22] nice [09:09:38] Jenkins init script is borked somehow [09:27:16] hashar: ? [09:29:21] ori: the init script does not properly keep track of the process PID iirc [09:29:43] I filled some bugs about it, but since it is merely annoying I haven't looked at it closely [09:31:26] hashar: PIDFILE=/var/run/jenkins/jenkins.pid [09:31:36] # ls /var/run/jenkins/jenkins.pid [09:31:36] ls: cannot access /var/run/jenkins/jenkins.pid: No such file or directory [09:33:10] gotta reproduce that in labs one day [09:48:35] hashar: is that the package maintainer's init script? [09:48:44] it's a bit bizarre that it's using 'daemon' rather than 'start-stop-daemon' [09:49:22] we grab upstream debian package [09:49:44] so it is probably the init script in Jenkins sources [09:51:33] * odder waves at hashar and ori [09:51:45] hi odder [09:57:46] Guten Tag [10:03:56] morgen [10:19:06] mogge [11:04:12] PROBLEM - Puppet freshness on dysprosium is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 07:45:00 PM UTC [12:20:40] (03PS2) 10Siebrand: Ignore PhpStorm files [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113531 [12:20:45] (03CR) 10Hashar: [C: 032] Ignore PhpStorm files [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113531 (owner: 10Siebrand) [12:20:52] (03Merged) 10jenkins-bot: Ignore PhpStorm files [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113531 (owner: 10Siebrand) [12:31:26] (03CR) 10Hashar: "The piuparts issue at http://integration.wikimedia.org/ci/job/operations-debs-git-fat-debian-glue/3/tapResults/?" [operations/debs/git-fat] (debian) - 10https://gerrit.wikimedia.org/r/113018 (owner: 10Ottomata) [13:37:31] (03CR) 10Petrb: [C: 04-1] "I don't see why you removed that resolve helper, which now makes the script less useful. Also what if I wanted to use -v as a first parame" [operations/puppet] - 10https://gerrit.wikimedia.org/r/113755 (owner: 10Hoo man) [13:45:08] (03PS1) 10Ori.livneh: gdash: add monthly graphs to 'frontend' dashboard [operations/puppet] - 10https://gerrit.wikimedia.org/r/113774 [13:46:25] (03CR) 10Ori.livneh: [C: 032] gdash: add monthly graphs to 'frontend' dashboard [operations/puppet] - 10https://gerrit.wikimedia.org/r/113774 (owner: 10Ori.livneh) [13:46:29] (03PS26) 10Matanya: site: lint [operations/puppet] - 10https://gerrit.wikimedia.org/r/109507 [13:53:35] (03PS1) 10Matanya: bugzilla: remove files dir. now templates in bugzilla module [operations/puppet] - 10https://gerrit.wikimedia.org/r/113775 [14:05:12] PROBLEM - Puppet freshness on dysprosium is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 07:45:00 PM UTC [14:38:27] (03PS1) 10coren: DNS: add labstore.svc.eqiad.wmnet [operations/dns] - 10https://gerrit.wikimedia.org/r/113778 [14:42:32] (03PS2) 10coren: DNS: add labstore.svc.eqiad.wmnet [operations/dns] - 10https://gerrit.wikimedia.org/r/113778 [14:45:30] +2 for this? ^^ I don't like self-merge on DNS. :-) [14:49:25] (03CR) 10MaxSem: DNS: add labstore.svc.eqiad.wmnet (031 comment) [operations/dns] - 10https://gerrit.wikimedia.org/r/113778 (owner: 10coren) [14:51:42] (03CR) 10coren: DNS: add labstore.svc.eqiad.wmnet (031 comment) [operations/dns] - 10https://gerrit.wikimedia.org/r/113778 (owner: 10coren) [14:52:20] MaxSem: Are we doing spaces in zone files now? I don't mind either way but right now it's a happy fun random mix of both. [14:52:46] I actually had to turn off expandtabs to edit that file. :-) [14:53:19] Coren, the block you were adding to had spaces [14:53:23] (03PS1) 10Matanya: appserver php: remove files moved to appserver module [operations/puppet] - 10https://gerrit.wikimedia.org/r/113784 [14:53:32] Ah. That's sensical. [14:53:37] * Coren switches. [14:54:18] that's why all code style switches have to be done in one huge commit altering whole repository:D [14:54:20] (03PS3) 10coren: DNS: add labstore.svc.eqiad.wmnet [operations/dns] - 10https://gerrit.wikimedia.org/r/113778 [14:58:52] Coren: mark said he didn't like the .svc. idea [14:59:28] but I'll defer to him [15:01:22] paravoid: can ops find some man power to review some of my stuff? i'm above 40 pending review. would you like me to pause? I fear i'm putting to much pressure [15:01:32] *too [15:01:34] I think akosiaris had promised in an ops meeting to help [15:02:32] i can take a break if it will help [15:03:02] yeah I got kind of stalled in that front. Trying to finish up that catalog differ [15:05:05] akosiaris: if it is too troublesome for you, just let me know. i'm not looking to adding more work. i'm sure you are overloaded [15:06:02] matanya: you are joking, right ? I ain't gonna tell you to stop working if you feel like it :-) [15:06:57] no, i'm serious, if it adds load, with low added value, so no point in doing it, right? [15:07:54] matanya: who said it is low added value ? [15:08:07] * matanya did [15:08:52] then he is lying to you :P [15:09:51] :P [15:18:02] Hm, for a quoted top-scope variable in puppet, is "${::dc}" proper syntax? [15:18:22] yes [15:18:42] and @dc in templates please [15:19:56] (03CR) 10Mark Bergsma: add salt grains automatically in system::role (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/107831 (owner: 10Dzahn) [15:27:41] (03PS1) 10Andrew Bogott: Fix some of the string-substitution magic in the labs vm build. [operations/puppet] - 10https://gerrit.wikimedia.org/r/113788 [15:32:45] (03CR) 10Andrew Bogott: [C: 032] Fix some of the string-substitution magic in the labs vm build. [operations/puppet] - 10https://gerrit.wikimedia.org/r/113788 (owner: 10Andrew Bogott) [15:33:02] what is @dc? [15:34:46] or, well, $::dc [15:35:25] mark: It's magic labs/ldap stuff [15:35:35] ah ldap DC, right [15:35:42] yeah [15:36:04] we have two competing systems of DNS in labs… I'm not sure we really need them both, maybe some time I can stamp out the old Amazon-style ID stuff. [15:36:22] (03CR) 10coren: [C: 032] "Mark is okay with it after all. :-)" [operations/dns] - 10https://gerrit.wikimedia.org/r/113778 (owner: 10coren) [15:37:35] andrewbogott: Do you need to create new images for that change to work? [15:38:33] Yeah, I do. And I'm not sure it's actually the fix for 61413, but it's certainly a candidate. [15:38:56] Oh, and plus it didn't work. Dammit [15:50:14] (03CR) 10Nemo bis: Update ULS config (031 comment) [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/112435 (owner: 10Nikerabbit) [15:56:26] Hm, did this patch make my job run every two minutes, or once per hour at 2 minutes past the hour? https://gerrit.wikimedia.org/r/#/c/113332/1/modules/puppetmaster/manifests/labs.pp [15:57:18] andrewbogott: That should be once per hour. you want '*/2' for every two minutes. [15:57:36] that explains… several things [15:58:04] aww [15:59:23] (03PS1) 10Andrew Bogott: Update puppet every two minutes, not at two after the hour. [operations/puppet] - 10https://gerrit.wikimedia.org/r/113790 [15:59:28] Coren: ^ what I want? [16:00:00] yes andrewbogott [16:00:06] thx [16:00:10] andrewbogott: Yep. [16:00:15] but better add hour too * [16:00:19] (03CR) 10Andrew Bogott: [C: 032] Update puppet every two minutes, not at two after the hour. [operations/puppet] - 10https://gerrit.wikimedia.org/r/113790 (owner: 10Andrew Bogott) [16:00:37] matanya: the cron resource type defaults to '*' for every field. [16:00:38] in case of weird things that happen (and did happen in the past) [16:01:09] i does, but doesn't hurt ot be specific in some cases :) [16:01:32] matanya: looks like I'm following the style of crons elsewhere, so I think I'm happy... [16:01:36] or will be if it works :) [16:01:43] good [16:03:33] Coren: But Puppet is special about updating cron jobs: "An important note: the Cron type will not reset parameters that are removed from a manifest. For example, removing a minute => 10 parameter will not reset the minute component of the associated cronjob to *. These changes must be expressed by setting the parameter to minute => absent because Puppet only manages parameters that are out of sync with manifest entries." [16:03:33] scfc_de`: Ah, good point. [16:03:34] scfc_de: That's good to keep in mind when one is /changing/ an entry. [16:04:53] and it is true to most types. if you remove a file from a manifest it doesn't delete it from the server. you must have ensure=>absent before [16:14:37] !log Jenkins added two labs slaves with 4 CPU: integration-slave02 and integration-slave03 [16:14:45] Logged the message, Master [16:15:08] !log Jenkins deleting slave integration-slave01 (had only 2 CPU) [16:15:16] Logged the message, Master [17:06:12] PROBLEM - Puppet freshness on dysprosium is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 07:45:00 PM UTC [18:53:54] (03Abandoned) 10Odder: Raise account creation throttle for a SMA session [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113757 (owner: 10Odder) [19:57:42] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1293.199951 [20:07:12] PROBLEM - Puppet freshness on dysprosium is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 07:45:00 PM UTC [20:09:38] (03PS1) 10coren: Labs: Add support for eqiad in /etc/nslcd.conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/113802 [20:11:05] apergos: Got time to do a quick review? [20:20:51] yes [20:22:55] Coren: (blah can't tab complete your nick) tell be about ou-servicegroups please [20:23:07] *ou=servicegroups [20:25:35] beta cluster is 503'ing intermittently again [20:25:52] apergos: All the labs service groups (in the form projectname.groupname) are stuffed in that OU. Old-style has them in the form (groupname) in a per-project OU. This allows the NFS server to know about ALL THE GROUPS!!! [20:26:28] Our original intent was to isolate group names per-project. That turned out to be more trouble than any putative benifit in practice. [20:35:10] I need to know how can i start contributing to wikimedia commons app ? [20:35:20] which api's will be used ? [20:36:26] wait so it's some huge ginormous comma separated string of groups or something? [20:36:46] (sorry, I'm typing in another desktop so I'm a little slow to respond) [20:36:48] Coren [20:37:51] apergos: No, those are entries in the ou. In practice, those lines will expand to: [20:37:57] base passwd ou=people,ou=servicegroups,dc=wikimedia,dc=org [20:37:57] base shadow ou=people,ou=servicegroups,dc=wikimedia,dc=org [20:37:57] base group ou=servicegroups,dc=wikimedia,dc=org [20:38:16] ah yes in the ou, sorrry [20:38:41] ok got it [20:44:39] Coren: can you elsif insteaad of an extra if...end? [20:46:49] apergos: Syntactically, yes. Conceptually, I dunno if it's a good idea. It's "what to do in eqiad" / "what to do in pmtpa", that pmtpa's part is an if is a coincidence. I can change it if you think it makes it clearer though. [20:47:42] I see, no that's ok [20:48:12] lgmt [20:48:39] (03CR) 10ArielGlenn: [C: 031] Labs: Add support for eqiad in /etc/nslcd.conf [operations/puppet] - 10https://gerrit.wikimedia.org/r/113802 (owner: 10coren) [20:49:11] (03CR) 10coren: [C: 032] "Good enough for a push out the door." [operations/puppet] - 10https://gerrit.wikimedia.org/r/113802 (owner: 10coren) [20:56:56] !log depooling cp3022.esams.wikimedia.org to investigate varnishkafka issues [20:57:03] Logged the message, Master [20:57:15] (03PS1) 10coren: Labs: Allow '.' (U+002E) in usernames from LDAP [operations/puppet] - 10https://gerrit.wikimedia.org/r/113874 [21:16:13] (03CR) 10coren: [C: 032] "Trivial enough for self-merge (single character addition to regex)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/113874 (owner: 10coren) [21:19:42] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:25:42] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 156.5 [21:27:42] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:31:42] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 68.300003 [21:32:46] we know! [21:36:33] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [21:40:42] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 18.9 [22:07:42] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [22:12:42] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 316.933319 [22:28:42] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [22:31:42] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 322.235291 [22:33:42] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [22:36:42] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 134.399994 [22:37:42] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [22:40:42] PROBLEM - Varnishkafka Delivery Errors on cp3022 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 233.366669 [22:43:43] (03PS1) 10Jeremyb: redirect ukwikimedia to wikimedia.org.uk [operations/apache-config] - 10https://gerrit.wikimedia.org/r/113877 [22:45:47] (03CR) 10Jeremyb: [C: 04-1] "-1 Pending UK approval of skipping the ukold part of the bug." [operations/apache-config] - 10https://gerrit.wikimedia.org/r/113877 (owner: 10Jeremyb) [22:48:24] (03PS2) 10Jeremyb: redirect ukwikimedia to wikimedia.org.uk [operations/apache-config] - 10https://gerrit.wikimedia.org/r/113877 [22:49:21] (03CR) 10Jeremyb: [C: 04-1] redirect ukwikimedia to wikimedia.org.uk [operations/apache-config] - 10https://gerrit.wikimedia.org/r/113877 (owner: 10Jeremyb) [22:49:42] RECOVERY - Varnishkafka Delivery Errors on cp3022 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [23:03:38] (03PS1) 10Jeremyb: close ukwikimedia [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113878 [23:06:13] (03CR) 10Jeremyb: "Is anything more needed to close or this is sufficient?" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/113878 (owner: 10Jeremyb) [23:07:35] (03CR) 10Jeremyb: "will also close wiki in Ib45270536bdf207a7edeec24082f08df3af2a60f" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/113877 (owner: 10Jeremyb) [23:08:12] PROBLEM - Puppet freshness on dysprosium is CRITICAL: Last successful Puppet run was Fri 14 Feb 2014 07:45:00 PM UTC