[00:02:31] PROBLEM - Disk space on labsdb1002 is CRITICAL: DISK CRITICAL - free space: /a 124499 MB (3% inode=99%): [00:27:24] hoo: ah, ok [00:28:31] PROBLEM - Disk space on labsdb1002 is CRITICAL: DISK CRITICAL - free space: /a 124888 MB (3% inode=99%): [00:38:39] !log mail is stuck, lots of mails queued in exim [00:38:44] Logged the message, Master [00:39:28] springle: ping [00:51:50] RECOVERY - Disk space on labsdb1002 is OK: DISK OK [01:34:56] lfaraone: pong [01:35:32] springle: do you know the right person to talk to about the above mail issue? [01:36:38] lfaraone: probably paravoid. looking... [01:37:51] Thanks for popping on springle [01:44:12] sodium unhappy i guess [01:44:14] Failed to create spool file /var/spool/exim4/input//1WytIR-0006Fw-K8-D: No space left on device [01:57:42] PROBLEM - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is CRITICAL: Connection timed out [01:58:42] RECOVERY - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 74185 bytes in 0.649 second response time [02:02:01] Filesystem Inodes IUsed IFree IUse% Mounted on [02:02:01] /dev/md0 977280 977280 0 100% / [02:02:01] inodes [02:02:11] ouch [02:02:32] springle: Yikes... can you free up space in /tmp or /var/tmp maybe? [02:02:38] for some reason /var/spool/exim4/input is on / [02:06:47] on a likely unrelated note some comments about database connection error's on enWiki (via #wikimedia-tech), not able to replicate yet [02:07:27] ah [02:07:35] Mon Jun 23 2:02:31 UTC 2014 mw1170 enwiki Error connecting to 10.64.32.19: :real_connect(): (08004/1040): Too many connections [02:07:37] :/ [02:07:50] ah, there's one of us replicating [02:08:22] looks good again [02:09:54] that's an external storage server [02:10:29] I don't have shell on that box, so I can't really tell much despite that it's fine again [02:12:58] ganglia says there was a mysql spike right then which seems to have calmed down (was only up for a minute or two) [02:13:43] given the timing I'd say: puppet run [02:13:49] springle: --^ [02:14:15] same problems on the same server at 14:02 and 02:01/02 [02:14:56] !log LocalisationUpdate completed (1.24wmf9) at 2014-06-23 02:13:53+00:00 [02:15:04] Logged the message, Master [02:15:25] * jamesofur will be back in a bit, we appear to be having a fire alarm [02:26:41] PROBLEM - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: No route to host [02:26:56] !log LocalisationUpdate completed (1.24wmf10) at 2014-06-23 02:25:53+00:00 [02:27:01] Logged the message, Master [02:27:41] RECOVERY - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 74143 bytes in 0.587 second response time [02:34:31] PROBLEM - HTTPS on sodium is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [02:35:02] PROBLEM - HTTP on sodium is CRITICAL: Connection refused [02:35:16] me ^ [02:38:50] springle: Any opinion on the DB troubles mentioned above? When does puppet run on that box? [02:39:05] (03PS1) 10Hoo man: Remove no longer needed base::access::dc-techs [operations/puppet] - 10https://gerrit.wikimedia.org/r/141373 [02:46:30] !log LocalisationUpdate ResourceLoader cache refresh completed at Mon Jun 23 02:45:24 UTC 2014 (duration 45m 23s) [02:46:36] Logged the message, Master [02:59:53] !log moving lighttpd compressed archives on sodium off / to regain inodes [02:59:57] Logged the message, Master [03:00:41] this is horrible [03:00:41] PROBLEM - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: No route to host [03:01:01] but mchenry exim queue to lists should start to recover [03:01:28] and spamd on sodium is back up; web ui still down [03:01:41] RECOVERY - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 74152 bytes in 0.607 second response time [03:06:58] springle: thanks! [03:07:25] +1 :) [03:07:33] ... and good night [03:08:11] RECOVERY - HTTP on sodium is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 190 bytes in 6.253 second response time [03:08:31] RECOVERY - HTTPS on sodium is OK: SSL_CERT OK - X.509 certificate for lists.wikimedia.org from RapidSSL CA valid until Jan 31 02:58:36 2016 GMT (expires in 587 days) [03:11:07] Thanks springle I am indeed starting to see list posts ;) [03:12:31] PROBLEM - HTTPS on sodium is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 [03:13:01] PROBLEM - HTTP on sodium is CRITICAL: Connection refused [03:13:08] ACKNOWLEDGEMENT - HTTP on sodium is CRITICAL: Connection refused Sean Pringle ask springle in #wikimedia-operations [03:13:08] ACKNOWLEDGEMENT - HTTPS on sodium is CRITICAL: SSL_CERT CRITICAL: Error: verify depth is 6 Sean Pringle ask springle in #wikimedia-operations [03:13:23] Jamesofur: ok great [03:21:01] RECOVERY - HTTP on sodium is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 190 bytes in 0.001 second response time [03:21:31] RECOVERY - HTTPS on sodium is OK: SSL_CERT OK - X.509 certificate for lists.wikimedia.org from RapidSSL CA valid until Jan 31 02:58:36 2016 GMT (expires in 587 days) [05:28:41] PROBLEM - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: No route to host [05:29:41] RECOVERY - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 74144 bytes in 0.607 second response time [05:38:41] PROBLEM - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: No route to host [05:39:41] RECOVERY - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 74034 bytes in 0.598 second response time [05:41:57] (03CR) 1001tonythomas: "Also, to add:-" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141287 (owner: 1001tonythomas) [07:04:51] PROBLEM - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is CRITICAL: Connection timed out [07:05:41] RECOVERY - LVS HTTPS IPv4 on text-lb.ulsfo.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 74071 bytes in 0.614 second response time [07:10:43] <_joe_> wtf... [07:18:41] PROBLEM - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is CRITICAL: No route to host [07:19:41] RECOVERY - LVS HTTPS IPv6 on text-lb.ulsfo.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 74041 bytes in 0.612 second response time [07:21:06] <_joe_> so, this is just the ipv6 link flapping, from what I can see [07:21:29] <_joe_> !log powercycled cp4018, stuck with a blank console [07:21:34] Logged the message, Master [07:21:36] been doing that much of the day randomly [07:22:02] <_joe_> Jamesofur: the flapping? [07:22:31] aye, I know it happened around 2/2.5 hours ago and around 4/4.5 hours ago [07:22:38] (just from my logs) [07:23:06] <_joe_> it has to do with some route between the monitoring server and text-lb.ulsfo.wikimedia.org I'd say [07:23:19] <_joe_> unluckily, I have no ipv6 ocnnection here [07:23:31] RECOVERY - Host cp4018 is UP: PING OK - Packet loss = 0%, RTA = 75.02 ms [07:23:45] * Jamesofur nods [07:23:56] <_joe_> (italy has no ipv6-enabled DSL offerings, and I don't have set up a tunnel yet [07:23:59] me either, occasionally I do... comcast seems to roll the dice every day or so [07:24:59] <_joe_> also, without internal networking experience, I guess I could be of little help [07:25:23] <_joe_> I just verified the LVS servers are healthy and that they do see a pooled backend [07:25:31] PROBLEM - Varnishkafka log producer on cp4018 is CRITICAL: PROCS CRITICAL: 0 processes with command name varnishkafka [07:25:43] <_joe_> that's about as much as I can do [07:26:01] PROBLEM - Varnish HTTP text-backend on cp4018 is CRITICAL: Connection refused [07:26:10] <_joe_> mmh [07:28:31] RECOVERY - Varnishkafka log producer on cp4018 is OK: PROCS OK: 1 process with command name varnishkafka [07:43:01] RECOVERY - Varnish HTTP text-backend on cp4018 is OK: HTTP OK: HTTP/1.1 200 OK - 188 bytes in 0.152 second response time [07:57:47] (03PS5) 10Nikerabbit: Enable ContentTranslation extension on beta labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140723 [07:59:43] (03CR) 10Nikerabbit: "Now using the normal way to use the parsoid appropriate for the environment." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140723 (owner: 10Nikerabbit) [08:00:29] Now :) [08:00:33] Thanks Nikerabbit [08:13:21] PROBLEM - MySQL Processlist on db1021 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 2 copy to table, 174 statistics [08:14:21] RECOVERY - MySQL Processlist on db1021 is OK: OK 1 unauthenticated, 0 locked, 0 copy to table, 6 statistics [08:22:21] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I don't agree with the IfVersion approach. I'd create a template conf file with conditionals on apache version, so that running systems ha" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/141062 (owner: 10Ori.livneh) [08:33:35] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "As said for the other patch: I'd avoid mod_version and use template conditionals instead." [operations/puppet] - 10https://gerrit.wikimedia.org/r/141059 (owner: 10Ori.livneh) [08:34:56] (03CR) 10Giuseppe Lavagetto: [C: 031] [HAT] text-frontend VCL: set Content-Type if not set [operations/puppet] - 10https://gerrit.wikimedia.org/r/141086 (owner: 10Ori.livneh) [08:35:37] <_joe_> running an errand, BBL [09:40:06] !log killing sodium's lighttpd compress cache [09:40:12] Logged the message, Master [09:48:06] (03PS1) 10Faidon Liambotis: mailman: block ArchiveTeam ArchiveBot UA [operations/puppet] - 10https://gerrit.wikimedia.org/r/141386 [09:48:34] (03CR) 10Faidon Liambotis: [C: 032 V: 032] mailman: block ArchiveTeam ArchiveBot UA [operations/puppet] - 10https://gerrit.wikimedia.org/r/141386 (owner: 10Faidon Liambotis) [10:05:33] !log hardreset maerlant, stuck on console and no ssh [10:05:38] Logged the message, Master [10:05:53] for the curious, it has had climbing load for the last month http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20esams&h=maerlant.esams.wikimedia.org&m=cpu_report&r=custom&s=by%20name&hc=4&mc=2&cs=01%2F01%2F2014%2000%3A00%20&ce=06%2F23%2F2014%2000%3A00 [10:08:21] PROBLEM - MySQL Processlist on db1021 is CRITICAL: CRIT 0 unauthenticated, 0 locked, 0 copy to table, 593 statistics [10:10:21] RECOVERY - MySQL Processlist on db1021 is OK: OK 0 unauthenticated, 0 locked, 0 copy to table, 6 statistics [10:23:58] (03PS1) 10Filippo Giunchedi: repurpose eqiad bits appservers as general appservers [operations/puppet] - 10https://gerrit.wikimedia.org/r/141388 [10:25:38] bd808|BUFFER greg-g I've put you up on https://gerrit.wikimedia.org/r/141388 to dissolve bits appservers, in case something in the deployment depends on the specific host group being there [10:28:49] godog: lol you just halved the cluster's load by fixing maerlant [10:31:16] <_joe_> Nemo_bis: ? [10:32:06] <_joe_> maerlant is decommissioned right? [10:32:48] https://ganglia.wikimedia.org/latest/graph.php?c=Miscellaneous%20esams&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&st=1403519484&g=load_report&z=medium&r=hour > https://ganglia.wikimedia.org/latest/graph.php?me=Wikimedia&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2&g=load_report&z=medium [10:32:57] (03CR) 10QChris: "It seems we want to keep the save and" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141120 (https://bugzilla.wikimedia.org/66911) (owner: 10QChris) [10:33:04] (03CR) 10QChris: [C: 04-1] Have Wikimetrics use the redis module's configuration again [operations/puppet] - 10https://gerrit.wikimedia.org/r/141120 (https://bugzilla.wikimedia.org/66911) (owner: 10QChris) [10:33:41] Nemo_bis: yeah I was looking yesterday at the YoY cluster load and it was spiking up.. bah! [10:34:32] ah indeed _joe_ is right it is decom [10:34:34] ^^ [10:35:05] now we only need to get rid of the 10 PB/s network spikes :P [10:35:25] wait that's not real?! [10:35:55] sure not :) [10:36:41] just a bit ugly on the graphs https://ganglia.wikimedia.org/latest/graph.php?c=Swift%20esams&m=cpu_report&r=week&s=by%20name&hc=4&mc=2&st=1403519716&g=network_report&z=medium&r=week [10:37:21] !log powering down maerlant, decom-med [10:37:26] Logged the message, Master [10:48:41] (03CR) 10Hashar: "A few comments inline related to beta." (034 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/137803 (owner: 10Andrew Bogott) [10:50:56] (03CR) 10Amire80: [C: 031] Meta: automatic translation workflow state changes [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/137804 (owner: 10Awight) [10:55:52] (03PS1) 10Hashar: beta: drop now unneed ssh ferm rule [operations/puppet] - 10https://gerrit.wikimedia.org/r/141393 [10:57:38] (03CR) 10Hashar: [C: 04-1] "will rebase" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/140891 (owner: 10KartikMistry) [11:03:54] (03PS3) 10Hashar: beta: remove duplicate ferm::rule [operations/puppet] - 10https://gerrit.wikimedia.org/r/140891 (owner: 10KartikMistry) [11:05:52] hashar: thanks! [11:06:59] (03CR) 10Hashar: "Rebased. One of the rule was unneeded and I dropped it with https://gerrit.wikimedia.org/r/#/c/140891/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141393 (owner: 10Hashar) [11:07:31] (03CR) 10Hashar: "Sorry I wanted to comment on follow up https://gerrit.wikimedia.org/r/#/c/140891/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141393 (owner: 10Hashar) [11:07:43] (03CR) 10Hashar: "Rebased. One of the rule was unneeded and I dropped it with https://gerrit.wikimedia.org/r/#/c/140891/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140891 (owner: 10KartikMistry) [11:09:24] (03CR) 10Hashar: [C: 031] beta: drop now unneed ssh ferm rule [operations/puppet] - 10https://gerrit.wikimedia.org/r/141393 (owner: 10Hashar) [11:09:30] (03CR) 10Hashar: [C: 031] beta: remove duplicate ferm::rule [operations/puppet] - 10https://gerrit.wikimedia.org/r/140891 (owner: 10KartikMistry) [11:33:03] (03CR) 10QChris: [C: 04-1] "Will provide patch set that also allows to bring over" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141116 (https://bugzilla.wikimedia.org/66911) (owner: 10QChris) [11:44:07] (03PS1) 10Faidon Liambotis: realm: remove $all_prefixes [operations/puppet] - 10https://gerrit.wikimedia.org/r/141399 [11:44:09] (03PS1) 10Faidon Liambotis: exim: remove otrs_mail_from_hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/141400 [11:44:11] (03PS1) 10Faidon Liambotis: exim: minor exim.conf cleanup, mostly comments [operations/puppet] - 10https://gerrit.wikimedia.org/r/141401 [11:44:13] (03PS1) 10Faidon Liambotis: mail: remove wikimedia.cz from own domains [operations/puppet] - 10https://gerrit.wikimedia.org/r/141402 [11:44:15] (03PS1) 10Faidon Liambotis: mail: add WLM domains as own domains [operations/puppet] - 10https://gerrit.wikimedia.org/r/141403 [11:44:17] (03PS1) 10Faidon Liambotis: realm: rename exim_default_route_list & exim_mediawiki_route_list [operations/puppet] - 10https://gerrit.wikimedia.org/r/141404 [11:44:19] (03PS1) 10Faidon Liambotis: mail: remove hardcoded mchenry/lists values [operations/puppet] - 10https://gerrit.wikimedia.org/r/141405 [11:44:21] (03PS1) 10Faidon Liambotis: mail: add polonium to smarthosts as primary, remove lists [operations/puppet] - 10https://gerrit.wikimedia.org/r/141406 [11:44:26] :) [11:44:35] * paravoid waits for jenkins [11:45:04] (03PS2) 10QChris: Parametrize redis' settings needed for Wikimetrics [operations/puppet] - 10https://gerrit.wikimedia.org/r/141116 (https://bugzilla.wikimedia.org/66911) [11:47:22] (03CR) 10jenkins-bot: [V: 04-1] mail: remove hardcoded mchenry/lists values [operations/puppet] - 10https://gerrit.wikimedia.org/r/141405 (owner: 10Faidon Liambotis) [11:47:49] (03CR) 10jenkins-bot: [V: 04-1] mail: add polonium to smarthosts as primary, remove lists [operations/puppet] - 10https://gerrit.wikimedia.org/r/141406 (owner: 10Faidon Liambotis) [11:47:53] blergh [11:48:56] (03PS2) 10Faidon Liambotis: mail: remove hardcoded mchenry/lists values [operations/puppet] - 10https://gerrit.wikimedia.org/r/141405 [11:48:58] (03PS2) 10Faidon Liambotis: mail: add polonium to smarthosts as primary, remove lists [operations/puppet] - 10https://gerrit.wikimedia.org/r/141406 [11:53:56] ok, lunch, then merging these :) [12:00:15] (03PS2) 10QChris: Have Wikimetrics use the redis module's configuration again [operations/puppet] - 10https://gerrit.wikimedia.org/r/141120 (https://bugzilla.wikimedia.org/66911) [12:01:22] (03PS3) 10QChris: Have Wikimetrics use the redis module's configuration again [operations/puppet] - 10https://gerrit.wikimedia.org/r/141120 (https://bugzilla.wikimedia.org/66911) [12:02:46] (03CR) 10jenkins-bot: [V: 04-1] Have Wikimetrics use the redis module's configuration again [operations/puppet] - 10https://gerrit.wikimedia.org/r/141120 (https://bugzilla.wikimedia.org/66911) (owner: 10QChris) [12:17:30] !log rebuilding Cirrus index on group0 wikis to pick up changes like results boosting from categories and wikitext search [12:17:35] Logged the message, Master [12:22:34] (03CR) 10Faidon Liambotis: [C: 032] realm: remove $all_prefixes [operations/puppet] - 10https://gerrit.wikimedia.org/r/141399 (owner: 10Faidon Liambotis) [12:23:46] (03CR) 10Faidon Liambotis: [C: 032] exim: remove otrs_mail_from_hosts [operations/puppet] - 10https://gerrit.wikimedia.org/r/141400 (owner: 10Faidon Liambotis) [12:25:01] (03CR) 10Faidon Liambotis: [C: 032] exim: minor exim.conf cleanup, mostly comments [operations/puppet] - 10https://gerrit.wikimedia.org/r/141401 (owner: 10Faidon Liambotis) [12:25:14] (03CR) 10Faidon Liambotis: [C: 032] mail: remove wikimedia.cz from own domains [operations/puppet] - 10https://gerrit.wikimedia.org/r/141402 (owner: 10Faidon Liambotis) [12:25:35] (03CR) 10Faidon Liambotis: [C: 032] mail: add WLM domains as own domains [operations/puppet] - 10https://gerrit.wikimedia.org/r/141403 (owner: 10Faidon Liambotis) [12:33:03] (03PS3) 10Faidon Liambotis: mail: remove hardcoded mchenry/lists values [operations/puppet] - 10https://gerrit.wikimedia.org/r/141405 [12:33:05] (03PS2) 10Faidon Liambotis: realm: rename exim_default_route_list & exim_mediawiki_route_list [operations/puppet] - 10https://gerrit.wikimedia.org/r/141404 [12:33:07] (03PS3) 10Faidon Liambotis: mail: add polonium to smarthosts as primary, remove lists [operations/puppet] - 10https://gerrit.wikimedia.org/r/141406 [12:40:16] (03CR) 10Faidon Liambotis: [C: 032] "Tested with comparator." [operations/puppet] - 10https://gerrit.wikimedia.org/r/141404 (owner: 10Faidon Liambotis) [12:40:42] (03CR) 10Faidon Liambotis: [C: 032] "Tested with comparator." [operations/puppet] - 10https://gerrit.wikimedia.org/r/141405 (owner: 10Faidon Liambotis) [12:43:39] (03PS4) 10Faidon Liambotis: mail: add polonium to smarthosts as primary, remove lists [operations/puppet] - 10https://gerrit.wikimedia.org/r/141406 [12:52:40] (03PS1) 10Giuseppe Lavagetto: puppet3: switch everything else to puppet 3 [operations/puppet] - 10https://gerrit.wikimedia.org/r/141412 [12:56:17] (03CR) 10Giuseppe Lavagetto: [C: 032] puppet3: switch everything else to puppet 3 [operations/puppet] - 10https://gerrit.wikimedia.org/r/141412 (owner: 10Giuseppe Lavagetto) [12:58:56] (03PS5) 10Faidon Liambotis: mail: add polonium to smarthosts as primary, remove lists [operations/puppet] - 10https://gerrit.wikimedia.org/r/141406 [12:58:58] (03PS1) 10Faidon Liambotis: mail: add spamassassin to role::mail::mx [operations/puppet] - 10https://gerrit.wikimedia.org/r/141413 [12:59:16] (03PS2) 10Faidon Liambotis: mail: add spamassassin to role::mail::mx [operations/puppet] - 10https://gerrit.wikimedia.org/r/141413 [12:59:18] (03PS6) 10Faidon Liambotis: mail: add polonium to smarthosts as primary, remove lists [operations/puppet] - 10https://gerrit.wikimedia.org/r/141406 [12:59:34] (03CR) 10Faidon Liambotis: [C: 032] mail: add spamassassin to role::mail::mx [operations/puppet] - 10https://gerrit.wikimedia.org/r/141413 (owner: 10Faidon Liambotis) [13:06:45] (03PS7) 10Faidon Liambotis: mail: add polonium to smarthosts as primary, remove lists [operations/puppet] - 10https://gerrit.wikimedia.org/r/141406 [13:06:47] (03PS1) 10Faidon Liambotis: mail: add network::constants include to role class [operations/puppet] - 10https://gerrit.wikimedia.org/r/141415 [13:07:01] PROBLEM - DPKG on tungsten is CRITICAL: DPKG CRITICAL dpkg reports broken packages [13:07:09] <_joe_> mmmh [13:09:03] <_joe_> it's just puppet having an hard time installing the package as tungsten is loaded [13:09:45] <_joe_> but it should be ok now [13:10:01] RECOVERY - DPKG on tungsten is OK: All packages OK [13:13:12] (03CR) 10Faidon Liambotis: [C: 032] mail: add network::constants include to role class [operations/puppet] - 10https://gerrit.wikimedia.org/r/141415 (owner: 10Faidon Liambotis) [13:19:36] (03CR) 10Faidon Liambotis: [C: 032] mail: add polonium to smarthosts as primary, remove lists [operations/puppet] - 10https://gerrit.wikimedia.org/r/141406 (owner: 10Faidon Liambotis) [13:19:59] !log switching outbound email to polonium [13:20:04] Logged the message, Master [13:20:15] wohoo [13:20:17] mchenry is a pretty old box [13:20:22] and sanger, equal [13:20:30] I kept mchenry as secondary for now [13:20:31] 2006 I think [13:20:39] I'd like to remove mail relaying from lists [13:20:40] never had any issues [13:20:48] so I may either reformat mchenry with trusty [13:20:58] or get another eqiad box, until we get codfw [13:21:02] the latter [13:21:10] trusdtyUbuntu 7.04 auto-installed on Sun Apr 22 17:49:11 UTC 2007. [13:21:12] trusty is pretty funny for that box :) [13:22:36] people always talk about fenari being old, but fenari is pretty new [13:22:41] we installed it in 2009 I think [13:23:18] zwinger before it was much scarier [13:27:18] (03CR) 10Faidon Liambotis: [C: 04-1] "There's at least dsh groups as well. And don't forget pybal, which is not in puppet." [operations/puppet] - 10https://gerrit.wikimedia.org/r/141388 (owner: 10Filippo Giunchedi) [13:27:53] (03PS2) 10Faidon Liambotis: swift: don't automatically restart backend services [operations/puppet] - 10https://gerrit.wikimedia.org/r/140922 (owner: 10Filippo Giunchedi) [13:28:48] (03CR) 10Faidon Liambotis: [C: 04-1] "(I updated the commit message)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140922 (owner: 10Filippo Giunchedi) [13:41:03] (03PS1) 10Faidon Liambotis: exim: drop wiki-mail from WIKI_INTERFACE [operations/puppet] - 10https://gerrit.wikimedia.org/r/141419 [13:41:55] (03PS1) 10Faidon Liambotis: mail: add system::role to role::mail::mx [operations/puppet] - 10https://gerrit.wikimedia.org/r/141420 [13:42:10] (03CR) 10Faidon Liambotis: [C: 032 V: 032] exim: drop wiki-mail from WIKI_INTERFACE [operations/puppet] - 10https://gerrit.wikimedia.org/r/141419 (owner: 10Faidon Liambotis) [13:42:26] (03CR) 10Faidon Liambotis: [C: 032 V: 032] mail: add system::role to role::mail::mx [operations/puppet] - 10https://gerrit.wikimedia.org/r/141420 (owner: 10Faidon Liambotis) [13:46:56] (03PS2) 10Filippo Giunchedi: repurpose eqiad bits appservers as general appservers [operations/puppet] - 10https://gerrit.wikimedia.org/r/141388 [13:59:01] (03CR) 10Filippo Giunchedi: "removed the empty file files/dsh/group/apaches-bits, plan is to merge back the machines in pybal as soon as this is merged" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141388 (owner: 10Filippo Giunchedi) [14:02:05] (Cannot contact the database server: Too many connections (10.64.16.34)) [14:02:22] same thing!! [14:02:25] Cannot contact the database server: Too many connections (10.64.16.10)) [14:02:50] works again [14:07:52] <_joe_> aude: it was a temprary hiccup I'd say [14:08:02] <_joe_> we need to understand why [14:08:38] * aude nods [14:11:06] both of these servers are s5 [14:14:15] !log upgraded Zuul by one commit (that introduces swift supports though disabled it on our setup via a custom hack) [14:14:20] Logged the message, Master [14:22:38] (03CR) 10Alexandros Kosiaris: "Answering inline" (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 (owner: 10Nikerabbit) [14:26:19] (03PS3) 10Filippo Giunchedi: swift: don't automatically restart backend services [operations/puppet] - 10https://gerrit.wikimedia.org/r/140922 [14:27:34] (03CR) 10jenkins-bot: [V: 04-1] swift: don't automatically restart backend services [operations/puppet] - 10https://gerrit.wikimedia.org/r/140922 (owner: 10Filippo Giunchedi) [14:27:44] (03CR) 10Filippo Giunchedi: "ah yeah, I've cleaned that up and merged all the services" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140922 (owner: 10Filippo Giunchedi) [14:31:05] (03PS4) 10Filippo Giunchedi: swift: don't automatically restart backend services [operations/puppet] - 10https://gerrit.wikimedia.org/r/140922 [14:31:10] and now with the right syntax [14:36:04] (03PS1) 10Faidon Liambotis: Drop wikimedia.cz; upstream NS pointed elsewhere [operations/dns] - 10https://gerrit.wikimedia.org/r/141423 [14:36:06] (03PS1) 10Faidon Liambotis: Remove ip4:208.80.152.164 from donate's SPF [operations/dns] - 10https://gerrit.wikimedia.org/r/141424 [14:36:08] (03PS1) 10Faidon Liambotis: Remove smtp.pmtpa.wmnet service alias [operations/dns] - 10https://gerrit.wikimedia.org/r/141425 [14:36:10] (03PS1) 10Faidon Liambotis: MX switch, part 1 [operations/dns] - 10https://gerrit.wikimedia.org/r/141426 [14:36:12] (03PS1) 10Faidon Liambotis: MX switch, part 2 [operations/dns] - 10https://gerrit.wikimedia.org/r/141427 [14:36:34] Jeff_Green: ^ :) [14:36:38] (03CR) 10Alexandros Kosiaris: [C: 04-1] "This is shaping along nicely. Still needs modularization and IMHO dropping the $DEFAULTFILE approach. Modularization will also make it pos" (033 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 (owner: 10Nikerabbit) [14:37:11] also mark ^^ I suppose [14:37:51] (03CR) 10Faidon Liambotis: [C: 032] "Fairly obvious." [operations/dns] - 10https://gerrit.wikimedia.org/r/141423 (owner: 10Faidon Liambotis) [14:38:20] paravoid: oh fun [14:38:32] tilting at mailservers again [14:38:33] Jeff_Green: I switched outbound earlier today too [14:38:48] I also have some patches that I haven't cleaned up/pushed yet that move otrs/rt into separate configs [14:38:52] ok [14:38:53] and oh boy, it's awesome [14:38:54] !log Further upgraded Zuul up to upstream b8c24ce + our local hacks. Git tag is wmf-deploy-20140623-4 [14:38:58] you can actually understand otrs' config [14:38:59] Logged the message, Master [14:39:22] you mean you can actually understand the exim config? [14:39:26] yes! [14:39:33] exim being the operative word there :-P [14:39:37] yeah, sorry :P [14:39:40] ha [14:40:33] before I start reviewing stuff--we need to allow frack boxes access to any new mtas [14:40:42] oh nice [14:40:45] I forgot about that [14:40:50] there's a global config to flip in frack puppet [14:40:59] but I was cautious enough to keep mchenry as the backup [14:41:02] cool [14:41:07] so it's mchenry/sodium -> polonium/mchenry [14:41:15] ok [14:41:16] so it will fall back to mchenry and work I guess [14:41:27] they won't even try polonium yet [14:41:29] I plan to empty out mchenry from inbound as well [14:41:34] ok great [14:41:35] then inspect mchenry's logs [14:41:48] so I would have caught that eventually I guess, but awesome that you thought about it now :) [14:42:31] I have a fairly large commit I want to blast out there first, which I've been sitting on since friday, then I'll look at the frack puppet conf [14:42:47] I'll add a ticket for the hw firewall access [14:43:11] cool [14:43:18] I also responded to your other firewall ticket btw [14:43:21] i saw [14:43:22] regarding frimpressions [14:43:26] right [14:43:28] in the meantime, I guess "Remove ip4:208.80.152.164 from donate's SPF" is okay? [14:43:52] i meant it in the sense of "misc db's" that people throw around, but I guess that's now split m1 vs m2 [14:44:08] yup [14:44:29] what's 208.80.152.164 ? there's no rdns [14:44:42] aluminium's predecessor in Tampa, I'm guessing [14:44:53] oh! grosley. yes, can remove [14:45:04] (03CR) 10Faidon Liambotis: [C: 032] Remove ip4:208.80.152.164 from donate's SPF [operations/dns] - 10https://gerrit.wikimedia.org/r/141424 (owner: 10Faidon Liambotis) [14:47:24] andrewbogott: ping [14:47:33] labstore1001: [14:47:33] The last Puppet run was at Fri Jun 13 18:17:15 UTC 2014 (14190 minutes ago). [14:48:10] paravoid: as far as I know that's still suspected due to the failed attempt to upgrade to 10g [14:48:24] I don't know how that relates to puppet being disabled though [14:48:37] (03CR) 10Alexandros Kosiaris: [C: 032] beta: remove duplicate ferm::rule [operations/puppet] - 10https://gerrit.wikimedia.org/r/140891 (owner: 10KartikMistry) [14:48:54] hashar: merging this ^ [14:49:21] paravoid: Coren will be back in a couple of days, so if the problem will keep until Wednesday I'd prefer to leave it in his court [14:49:40] manybubbles: I'll SWAT today [14:49:44] ok, although we've said numerous times that puppet should not be disabled for more than a few hours at most [14:49:48] noahsussman: ok! [14:49:59] I was about to break the box's email, for example [14:50:05] paravoid: I agree. [14:50:06] twkozlowski: Ping, SWAT in 10 minutes [14:50:12] akosiaris: hopefully that will be fine. There is a dependent change https://gerrit.wikimedia.org/r/#/c/141393/ [14:50:41] paravoid: It is /probably/ safe to reenable, but I know that I asked Coren if I could reenable a week ago and he asked me to hold off [14:50:59] (03CR) 10Alexandros Kosiaris: [C: 032] beta: drop now unneed ssh ferm rule [operations/puppet] - 10https://gerrit.wikimedia.org/r/141393 (owner: 10Hashar) [14:51:01] paravoid: he may need your support to untangle the network issues. [14:51:13] I can't see how they're related tbh [14:51:21] puppet won't [14:51:26] puppet won't mess with running interfaces [14:51:38] yeah, me neither [14:52:28] (03PS1) 10Faidon Liambotis: Switch tridge from base to standard [operations/puppet] - 10https://gerrit.wikimedia.org/r/141430 [14:52:50] akosiaris: see any reason why not? [14:52:58] or apergos? :) [14:53:20] not on my part [14:53:56] paravoid: I will reenable and keep an eye on it. [14:55:05] awesome :) [14:55:14] (03CR) 10Faidon Liambotis: [C: 032] Switch tridge from base to standard [operations/puppet] - 10https://gerrit.wikimedia.org/r/141430 (owner: 10Faidon Liambotis) [14:55:14] !log reenabling puppet on labstore1001, hoping it doesn't break labs [14:55:18] Logged the message, Master [14:55:37] RECOVERY - Puppet freshness on labstore1001 is OK: puppet ran at Mon Jun 23 14:55:29 UTC 2014 [14:57:04] paravoid: https://dpaste.de/yp4C [14:57:05] eek [14:57:10] (03PS19) 10KartikMistry: cxserver configuration for beta labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 (owner: 10Nikerabbit) [14:57:13] anomie: Pong, I'm here & ready :-) [14:58:15] bblack: ping? [14:58:38] (03CR) 10Faidon Liambotis: [C: 032] Remove smtp.pmtpa.wmnet service alias [operations/dns] - 10https://gerrit.wikimedia.org/r/141425 (owner: 10Faidon Liambotis) [15:00:03] paravoid: Ah, there's a little bit more: https://dpaste.de/oYsd [15:00:04] manybubbles, anomie: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140623T1500) [15:00:11] * anomie starts SWAT [15:00:19] Does that look breaky to you? Things seem to still be working fine after that applied... [15:00:22] twkozlowski: I'm going to do the bugfixes first, then your config changes [15:00:32] eh? [15:00:45] * apergos does the backread [15:01:01] (03CR) 10BryanDavis: "LGTM. I can't think of any specific way that this would affect scap. The mediawiki-installation dsh group is the main one used by scap and" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141388 (owner: 10Filippo Giunchedi) [15:01:12] andrewbogott: you can just remove the interface::aggregate { 'bond0': [15:01:15] * twkozlowski acknowledges receipt of anomie's message [15:01:16] from the puppet config [15:01:17] for now [15:01:59] apergos: can you fix the check_disk alert so that it doesn't fire off on dataset1001 when you get some time? [15:01:59] Hm, also just got +for INTERFACE in bond0 eth0 eth1 lo ; do [15:02:00] REQ_SPEED=1000 # The default for now [15:02:01] STATUS=`ip link show ${INTERFACE}` [15:02:02] if [ "$?" != "0" ]; then [15:02:05] PROBLEM - check configured eth on labstore1001 is CRITICAL: bond0 reporting no carrier. [15:02:11] ah yeah, thanks for reminding [15:02:13] and there it is... [15:02:18] :) [15:02:23] I'm just looking at icinga alerts [15:02:28] it's all a mess again :) [15:02:29] (03CR) 10Hashar: "Cherry picked on beta cluster puppet master." [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 (owner: 10Nikerabbit) [15:03:39] HTTP CRITICAL: HTTP/1.1 500 MediaWiki exception - 1764 bytes in 0.260 second response time [15:03:42] text-lb-eqiad [15:04:17] pybal.log:266 [15:04:17] pybal.log.1.gz:532 [15:04:17] pybal.log.2.gz:177 [15:04:17] pybal.log.3.gz:0 [15:04:28] (lvs1001's pybal) [15:04:39] !log anomie Synchronized php-1.24wmf10/includes/api/ApiExpandTemplates.php: SWAT: Fix fatal in API action=expandtemplates with Scribunto [[gerrit:141417]] (duration: 00m 14s) [15:04:44] Logged the message, Master [15:04:45] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data exceeded the critical threshold [500.0] [15:04:46] * anomie confirms fix [15:05:19] Yikes. That was a heck of a spike in "Too many connections" db errors a couple of minutes ago. seems to have died down now. 10.64.16.10, 10.64.16.26, 10.64.16.34, 10.64.16.154 all took part [15:05:39] (03CR) 10Hashar: [C: 04-1] cxserver configuration for beta labs (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 (owner: 10Nikerabbit) [15:06:08] the first three are s5, the last one is es1006 [15:06:24] (03PS1) 10Andrew Bogott: Marked out a networking WIP on labstore1001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/141433 [15:06:33] paravoid: ^ ? [15:07:11] (03CR) 10Faidon Liambotis: [C: 032] Marked out a networking WIP on labstore1001 [operations/puppet] - 10https://gerrit.wikimedia.org/r/141433 (owner: 10Andrew Bogott) [15:07:40] chasemp: ping [15:07:49] chasemp: oops I was going to ping you too [15:08:04] python-diamond? [15:08:51] not for me -- the admins.pp refactor is going poorly on labstore1001 [15:08:59] since that box also has a million ldap user accounts [15:09:03] Logstash shows 25,613 dberror messages from 15:01:37 to 15:02:49 [15:09:28] bd808: also happened earlier today [15:09:33] 14:00 UTC approximately [15:09:54] and also at approx. 2am UTC we had another spike, probably related as well [15:10:06] !log anomie Synchronized php-1.24wmf9/includes/api/ApiExpandTemplates.php: SWAT: Fix fatal in API action=expandtemplates with Scribunto [[gerrit:141416]] (duration: 00m 15s) [15:10:11] Logged the message, Master [15:10:12] * anomie verifies [15:10:21] twkozlowski: Ok, doing the ruwiki patch now [15:10:31] (03PS3) 10Anomie: Change some user group rights on ruwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140910 (https://bugzilla.wikimedia.org/66871) (owner: 10Odder) [15:10:37] (03CR) 10Anomie: [C: 032] "SWAT" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140910 (https://bugzilla.wikimedia.org/66871) (owner: 10Odder) [15:10:45] (03Merged) 10jenkins-bot: Change some user group rights on ruwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140910 (https://bugzilla.wikimedia.org/66871) (owner: 10Odder) [15:11:22] !log anomie Synchronized wmf-config/InitialiseSettings.php: SWAT: Adjust group rights on ruwiki [[gerrit:140910]] (duration: 00m 14s) [15:11:23] twkozlowski: ^ check please [15:11:25] Logged the message, Master [15:12:34] anomie: Works [15:12:50] (03PS2) 10Anomie: Add a Library of Congress domain to whitelist [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141308 (https://bugzilla.wikimedia.org/66945) (owner: 10Odder) [15:12:57] (03CR) 10Anomie: [C: 032] "SWAT" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141308 (https://bugzilla.wikimedia.org/66945) (owner: 10Odder) [15:13:10] (03Merged) 10jenkins-bot: Add a Library of Congress domain to whitelist [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141308 (https://bugzilla.wikimedia.org/66945) (owner: 10Odder) [15:13:34] !log anomie Synchronized wmf-config/InitialiseSettings.php: SWAT: Add a Library of Congress domain to wgCopyUploadsDomains [[gerrit:141308]] (duration: 00m 14s) [15:13:35] twkozlowski: ^ check please [15:13:39] Logged the message, Master [15:13:43] How do I check this, anomie? :-) [15:13:53] twkozlowski: I don't know, you're the one who submitted it [15:14:19] Well, the merge worked, so it's in the whitelist [15:14:34] * anomie is done with SWAT then [15:16:26] hi [15:17:15] (03CR) 10Faidon Liambotis: [C: 032] "Works for me, but I think it needs a manual rebase (I see $puppet_version = 3 in the code)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141388 (owner: 10Filippo Giunchedi) [15:17:37] (03CR) 10Faidon Liambotis: [C: 032] swift: don't automatically restart backend services [operations/puppet] - 10https://gerrit.wikimedia.org/r/140922 (owner: 10Filippo Giunchedi) [15:17:45] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [15:17:50] Thanks, anomie [15:18:00] paravoid: 30 day logstash view shows that the rate of connection exhaustion is trending up since 2014-06-12/2014-06-13; It buckets by 12h period in the 30 day timeline. [15:18:02] andrewbogott: why ping 4? :) [15:18:04] twkozlowski: No problem [15:18:30] chasemp: Once you're done with paravoid, log into labstore1001 and have a look at what puppet is doing there? [15:18:54] <_joe_> andrewbogott: may I help? [15:19:00] um… s/paravoid/paravoid and your breakfast/ [15:19:14] chasemp: python-diamond doesn't exist in lucid, but puppet tries to install it [15:19:21] _joe_: Maybe, but it's the admins refactor. So I figured chase would be interested. [15:20:12] bd808: 06-12 was a deployment window [15:20:16] paravoid: andrewbogott looking now...that is weird idk what is up yet [15:20:48] paravoid: ah yes, I'm sorry with the trusty packaging the version pinning was removed, I didn't take into account older things hanging about [15:21:01] I will fix to specifiy precise & trusty only? [15:21:08] or build for lucid as well? [15:21:14] <_joe_> chasemp: seems sensible [15:21:26] chasemp: remember that labstore1001 is crazy, user-account-wise. Because every labs ldap user has an account there as well. [15:21:29] To manage NFS privs [15:22:11] (03PS3) 10Filippo Giunchedi: repurpose eqiad bits appservers as general appservers [operations/puppet] - 10https://gerrit.wikimedia.org/r/141388 [15:22:48] paravoid: Hmmm.. possibly something introduced in 1.24wmf8 that hit the big wikis on that day [15:23:46] paravoid: your call, is lucid meant to be around for awhile? [15:23:53] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] repurpose eqiad bits appservers as general appservers [operations/puppet] - 10https://gerrit.wikimedia.org/r/141388 (owner: 10Filippo Giunchedi) [15:24:15] chasemp: not many boxes left, but if it's easy enough to backport, let's do it I'd say [15:24:21] chasemp: considering how long it took us to get rid of 8.04... [15:24:30] (03Abandoned) 10Nuria: Removing hostname from redis backup file [operations/puppet] - 10https://gerrit.wikimedia.org/r/140394 (owner: 10Nuria) [15:24:38] <_joe_> chasemp: from what I see, your admin class does not work well with ldap-defined users [15:25:32] _joe_: yes the overlap of accounts being in both puppet and ldap for the same box is problematic. I think saying one of the other is authoritative on a host is going to be necessary [15:25:39] or some version of this will continually happen [15:26:08] but this behavior is new, as in didn't happen when we originally tested on the labstore so that's odd [15:26:53] (03PS20) 10KartikMistry: cxserver configuration for beta labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 (owner: 10Nikerabbit) [15:27:20] chasemp: the ldap users on that system are important in order to manage permissions on nfs. But of course none of those users will actually ever log in... [15:29:58] (03PS1) 10Hashar: zuul: python-statsd has been packaged [operations/puppet] - 10https://gerrit.wikimedia.org/r/141442 [15:31:13] who wants to review some changes ? [15:31:32] If it is packaging changes :) [15:32:04] andrewbogott: I'm not sure what the sane thing to do here is :) I think for now not having competing user management has to be the case? i.e. remove include admin from ldap...? [15:32:29] maybe in the long term we put in some logic that does the right thing if ldap is already present but this is a complicated use case [15:32:40] chasemp: I'm not sure what you mean by 'remove include admin from ldap' [15:32:50] But, leaving that box with a root-only login seems ok to me [15:32:58] (if that's what you meant...) [15:33:12] yes that for practical purposes atm [15:33:36] and make a ticket for more gracefully handling this case [15:34:26] essentially, the puppet logic we piggy back on for ensuring users can't deal with the users defined in ldap and not in standard /etc/passwd [15:34:33] so it continually tries to mod users it can't find, etc [15:38:16] kart_: lint changes :) [15:42:57] <_joe_> chasemp: dpkg -l labstore10001 [15:43:01] <_joe_> ehrm [15:43:07] <_joe_> dpkg -l puppet [15:43:24] (03PS1) 10Filippo Giunchedi: swift: source is a bashism! replace with . [operations/puppet] - 10https://gerrit.wikimedia.org/r/141446 [15:43:43] <_joe_> I HEART bashisms [15:43:47] <_joe_> :) [15:43:48] dash FTW [15:43:57] _joe_: is it your thinking the puppet version changing changed the behavior? [15:44:02] (03PS2) 10Filippo Giunchedi: swift: source is a bashism! replace with . [operations/puppet] - 10https://gerrit.wikimedia.org/r/141446 [15:44:09] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] swift: source is a bashism! replace with . [operations/puppet] - 10https://gerrit.wikimedia.org/r/141446 (owner: 10Filippo Giunchedi) [15:44:14] <_joe_> chasemp: it's the only thing that changed [15:44:48] true :) I thought we were going pre 3.4? but either way seems like the best idea [15:48:40] (03CR) 10BryanDavis: Add roles for testing swift in labs (032 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/137803 (owner: 10Andrew Bogott) [15:53:38] _joe_: I have another question about apt and reprepro… I just dropped in a new package to precise/universe: [15:53:39] # reprepro ls tcl-trf [15:53:40] tcl-trf | 2.1.4-dfsg2-1 | precise-wikimedia | amd64 [15:53:49] But when I apt-get tcl-trf I still get the upstream version rather than that one. [15:54:06] which is 2.1.4-dfsg-2build2 [15:54:07] <_joe_> mhhh [15:54:33] <_joe_> andrewbogott: which server? [15:54:42] I'm installing on a labs box, tools-tcl-test [15:54:54] A one-off box, you can do whatever you want with it. [15:55:03] <_joe_> I have no idea how labs is set up regarding debs [15:55:15] hm [15:55:21] <_joe_> I can take a look later [15:55:22] well, this is in tools, maybe that's weird. [15:55:30] thanks, I will look at that as well. [15:55:48] The fact that you didn't expect this behavior suggets that I'm not totally confused :) [15:56:28] Actually, the apt config looks like a normal prod config (as it should be) [15:58:30] ottomata: hey [15:58:34] _joe_: I forgot to apt-get update! [15:59:07] paravoid: heya [15:59:23] ottomata: there's tons of an10xx alerts on icinga [15:59:36] first off, a disabled puppet for two weeks is not okay [15:59:47] I see at least three of them, the oldest one being June 9th [15:59:57] then there's a kafka broker alert on an1018 [16:00:14] oh and for bonus points, elastic1017 alerts :) [16:00:17] oh boy [16:00:21] :) [16:00:30] i just got back from a much much delayed overnight drive from toronto [16:00:34] oh hah [16:00:35] sorry :) [16:00:35] we were supposed to get back at like 3am last night [16:00:43] didn't leave til midnight (14 people in a big van) [16:00:47] <^d> Poor elastic1017 [16:00:51] i drove for like 7 hours and then slept for 2! [16:00:56] <^d> I thought that was still not setup fully. [16:00:58] <^d> 17-19 [16:01:00] also, analytics quarterly review! woo! [16:01:11] yes, elastic1007 (afaik) is waiting reinstallation of ssds [16:01:16] paravoid, quick q [16:01:21] re puppet disabled on kafka brokers [16:01:36] if you were going to do a week or two of tuning experiments [16:01:47] where you had to change a config value but then wait for a while to see the effect [16:01:52] PROBLEM - Puppet freshness on amssq44 is CRITICAL: Last successful Puppet run was Mon 23 Jun 2014 13:01:40 UTC [16:01:52] PROBLEM - Puppet freshness on amssq43 is CRITICAL: Last successful Puppet run was Mon 23 Jun 2014 13:01:30 UTC [16:02:02] what would you do? make commits for every change? even though they aren't changes that you might want to keep? [16:02:05] (03PS1) 10Rush: remove admin yaml from labstore* until RT #7732 [operations/puppet] - 10https://gerrit.wikimedia.org/r/141450 [16:02:07] yeah why not [16:02:27] guess cause its annoying? but ok! [16:02:27] :) [16:02:41] andrewbogott: to you good sir https://gerrit.wikimedia.org/r/#/c/141450/ [16:02:41] checking icinga... [16:03:08] i'm just going to ack the elastic1017 ones [16:03:16] we know it is down [16:03:18] (03CR) 10Faidon Liambotis: [C: 032] MX switch, part 1 [operations/dns] - 10https://gerrit.wikimedia.org/r/141426 (owner: 10Faidon Liambotis) [16:03:21] paravoid: pong [16:03:52] PROBLEM - Puppet freshness on amssq32 is CRITICAL: Last successful Puppet run was Mon 23 Jun 2014 13:02:46 UTC [16:03:52] (03PS2) 10Rush: add admin users to default node definition [operations/puppet] - 10https://gerrit.wikimedia.org/r/140930 [16:04:07] (03CR) 10Rush: [C: 032 V: 032] add admin users to default node definition [operations/puppet] - 10https://gerrit.wikimedia.org/r/140930 (owner: 10Rush) [16:04:31] ACKNOWLEDGEMENT - Kafka Broker Messages In on analytics1018 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 0.0 ottomata This is a newly puppetized kafka broker, but does not yet have any traffic partitions assigned to it yet. [16:04:52] PROBLEM - Puppet freshness on amssq35 is CRITICAL: Last successful Puppet run was Mon 23 Jun 2014 13:03:42 UTC [16:05:05] chasemp: admin isn't included from base or standard? [16:05:08] bblack: hey, I was grepping for boxes that haven't been updated with recent puppet changes and lots (all?) of amssq*.esams.wmnet came up [16:05:14] bblack: I assume that's normal, right? [16:05:17] andrewbogott: nope [16:05:26] (03PS1) 10Alexandros Kosiaris: Check puppet's last run [operations/puppet] - 10https://gerrit.wikimedia.org/r/141452 [16:05:48] paravoid: no, probably not normal [16:05:54] (03CR) 10Andrew Bogott: [C: 032] "Looks right, although as long as this is in place we absolutely cannot disable root logins :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141450 (owner: 10Rush) [16:06:11] (03PS2) 10Rush: remove admin yaml from labstore* until RT #7732 [operations/puppet] - 10https://gerrit.wikimedia.org/r/141450 [16:06:12] chasemp: you want to merge and deploy or shall I? [16:06:21] andrewbogott: on it now [16:06:23] thx [16:06:25] ACKNOWLEDGEMENT - Kafka Broker Server on analytics1021 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args kafka.Kafka /etc/kafka/server.properties ottomata analytics1021 has just had one of its hard drives replaced. Kafka needs fixed on this node. Otto is on it. [16:06:36] (03CR) 10Rush: [C: 032 V: 032] remove admin yaml from labstore* until RT #7732 [operations/puppet] - 10https://gerrit.wikimedia.org/r/141450 (owner: 10Rush) [16:06:38] Notice: Skipping run of Puppet configuration client; administratively disabled (Reason: 'Disabled by default on new installations'); [16:06:52] PROBLEM - Puppet freshness on amssq46 is CRITICAL: Last successful Puppet run was Mon 23 Jun 2014 13:06:10 UTC [16:06:52] that was from amssq31 [16:06:54] paravoid: it's somehow related to their install during the puppet3 migration process, my downgrading packages and then we upgraded them later, etc. somehow it ended up disabled [16:07:00] I'll fix it [16:07:52] PROBLEM - Puppet freshness on amssq34 is CRITICAL: Last successful Puppet run was Mon 23 Jun 2014 13:07:06 UTC [16:08:36] reprogramming my fingers from "puppetd -t" to "puppet agent -t" is annoying [16:08:52] PROBLEM - Puppet freshness on amssq36 is CRITICAL: Last successful Puppet run was Mon 23 Jun 2014 13:08:27 UTC [16:08:52] PROBLEM - Puppet freshness on amssq40 is CRITICAL: Last successful Puppet run was Mon 23 Jun 2014 13:08:32 UTC [16:08:59] yeah [16:09:02] paravoid, I will get to all this asap, but possibly not much of it until tomorrow. I'm pretty zombified right now. gonna make my meetings today but not much else...will take another day off to cover the my zombieness (since I doubt we get zombie comp days :p ) [16:09:17] ottomata: ok [16:09:29] bblack: I have aliased to 'pat' :) [16:09:52] PROBLEM - Puppet freshness on amssq41 is CRITICAL: Last successful Puppet run was Mon 23 Jun 2014 13:09:22 UTC [16:09:52] PROBLEM - Puppet freshness on amssq42 is CRITICAL: Last successful Puppet run was Mon 23 Jun 2014 13:09:12 UTC [16:09:52] RECOVERY - Puppet freshness on amssq42 is OK: puppet ran at Mon Jun 23 16:09:51 UTC 2014 [16:09:59] <_joe_> mmmh [16:10:04] <_joe_> is that my fault? [16:10:06] sorry, i was out thursday and friday too, so probably missed some of this then. I know kafka is in a weird state right now too, because i've been poking at it so much. that is mostly done with, i just have to have the time to put it back into prod state. now that we ahve analytics1021 back (thanks cmjohnson!) we can do that soon [16:10:09] <_joe_> can be? [16:10:22] RECOVERY - Puppet freshness on amssq41 is OK: puppet ran at Mon Jun 23 16:10:17 UTC 2014 [16:10:33] _joe_: what is? [16:10:51] <_joe_> the amssq* puppet runs [16:11:52] PROBLEM - Puppet freshness on amssq38 is CRITICAL: Last successful Puppet run was Mon 23 Jun 2014 13:11:39 UTC [16:12:05] _joe_: not really, it's just I happened to install them right in the middle of things. they got puppet3 packages and then something didn't work right (at that moment) because of it, and I had to manually downgrade to puppet2 [16:12:12] RECOVERY - Puppet freshness on amssq38 is OK: puppet ran at Mon Jun 23 16:12:08 UTC 2014 [16:12:24] something about that + the upgrade to running puppet3 everywhere a couple days later left them in some default-disabled state with puppet3 installed [16:12:25] <_joe_> bblack: what was not working? [16:12:49] <_joe_> sorry, it's something we should understand give we're on puppet 3 completely now [16:13:09] I don't think it's anything to worry about, just an artifact of that moment in time in the upgrade process [16:13:26] <_joe_> mh ok [16:13:49] IIRC it was something like a package version check on the puppet package itself failing, because they got puppet3 during their debian install but the manifests for these machines was still trying to force puppet2, or whatever [16:13:54] (03CR) 10Rush: [C: 031] pmacct - also create group for systemuser [operations/puppet] - 10https://gerrit.wikimedia.org/r/141074 (owner: 10Dzahn) [16:14:01] (03CR) 10Rush: [C: 031] rancid: also create system group for system user [operations/puppet] - 10https://gerrit.wikimedia.org/r/141078 (owner: 10Dzahn) [16:15:09] <_joe_> bblack: so I should re-enable puppet on those servers? [16:15:14] I already did [16:15:25] <_joe_> ok sorry [16:15:28] (and ran one manually, it was fine and picked up the exim changes) [16:16:12] RECOVERY - Puppet freshness on amssq44 is OK: puppet ran at Mon Jun 23 16:16:08 UTC 2014 [16:17:22] RECOVERY - Puppet freshness on amssq34 is OK: puppet ran at Mon Jun 23 16:17:20 UTC 2014 [16:17:22] RECOVERY - Puppet freshness on amssq46 is OK: puppet ran at Mon Jun 23 16:17:20 UTC 2014 [16:17:32] RECOVERY - Puppet freshness on amssq40 is OK: puppet ran at Mon Jun 23 16:17:25 UTC 2014 [16:17:32] RECOVERY - Puppet freshness on amssq35 is OK: puppet ran at Mon Jun 23 16:17:25 UTC 2014 [16:17:32] RECOVERY - Puppet freshness on amssq43 is OK: puppet ran at Mon Jun 23 16:17:25 UTC 2014 [16:18:03] RECOVERY - Puppet freshness on amssq32 is OK: puppet ran at Mon Jun 23 16:18:00 UTC 2014 [16:18:13] RECOVERY - Puppet freshness on amssq36 is OK: puppet ran at Mon Jun 23 16:18:05 UTC 2014 [16:18:32] ^ ran the rest just to clear the alerts [16:22:27] <_joe_> bblack: thanks :) [16:32:59] (03CR) 10Dzahn: [C: 031] "works for me. tried on iron:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141452 (owner: 10Alexandros Kosiaris) [16:33:52] !log switched inbound mail for all non-wikimedia.org domains from mchenry/sodium to polonium/mchenry (~16:00 + <= 1h TTL UTC) [16:33:57] Logged the message, Master [16:34:36] (03CR) 10Dzahn: [C: 031] Remove no longer needed base::access::dc-techs [operations/puppet] - 10https://gerrit.wikimedia.org/r/141373 (owner: 10Hoo man) [16:35:54] (03CR) 10Rush: [C: 031] "seems good man, just wanted to say when this does go in...I don't think ignoring disabled puppet hosts is a good idea. I would even say n" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141452 (owner: 10Alexandros Kosiaris) [16:49:16] !log added mw1149-52 back to pybal apache [16:49:20] Logged the message, Master [16:49:23] \o/ [16:49:40] wohoo, still mildly scared to hit :w :)) [16:49:55] <_joe_> godog: eheh it's your first time editing it? [16:50:02] yeah [16:50:03] <_joe_> it is scary, ain't it? [16:50:28] haha a bit yeah [16:50:51] <_joe_> that's why we should not be editing it by hand. [16:51:16] <_joe_> but this is another story [17:04:14] (03CR) 10Yuvipanda: "Anything blocking this?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140646 (owner: 10Yuvipanda) [17:04:27] !log Fixed dangling symlink for /etc/apache2/sites-enabled/logstash.wikimedia.org on logstash1001 by deleting symlink and forcing puppet run [17:04:32] Logged the message, Master [17:12:53] greg-g, is this swattable; or should I go through some other process? https://gerrit.wikimedia.org/r/#/c/137804/ [17:14:29] mwalker: looks swattable to me in all honesty [17:15:06] ya; we've attempted to notify the community, amir and niklas have signed off [17:15:29] but, greg-g might want to put it in next weeks deploy notes [17:17:06] mwalker: Greg is only sort-of online today. He's not feeling well. [17:17:41] in that case; /me adds it to today's swat [17:18:14] Opsen, I believe this is ready to go, if anyone can spare the CR? https://gerrit.wikimedia.org/r/#/c/137804/ [17:18:45] awight, read the backscroll :p [17:19:41] mwalker: hah, thanks for batting [17:28:37] (03PS1) 10RobH: adding stream.wikimedia.org certificate [operations/puppet] - 10https://gerrit.wikimedia.org/r/141471 [17:31:39] is it safe to restart the salt master on palladium? I think it has stopped cleaning up after itself [17:32:18] (03CR) 10RobH: [C: 032] adding stream.wikimedia.org certificate [operations/puppet] - 10https://gerrit.wikimedia.org/r/141471 (owner: 10RobH) [17:32:54] woot! ^ [17:35:13] godog: yes [17:36:03] akosiaris: thanks! [17:36:53] !log restarted salt-master on palladium, suspected job cleanup stuck [17:36:57] Logged the message, Master [17:37:16] Nemo_bis: are you affiliated with the Archive Team? [17:38:04] !log lvs1005:eth3 was negotiated to 100mbps (???) - disable -> enable on switch fixed it [17:38:09] Logged the message, Master [17:38:19] bblack: I had an RT ticket for that [17:38:40] bblack: https://rt.wikimedia.org/Ticket/Display.html?id=7731 [17:38:56] (03PS1) 10ArielGlenn: data retention audit script for logs, /root and /home dirs [operations/software] - 10https://gerrit.wikimedia.org/r/141473 [17:38:59] (03CR) 10jenkins-bot: [V: 04-1] data retention audit script for logs, /root and /home dirs [operations/software] - 10https://gerrit.wikimedia.org/r/141473 (owner: 10ArielGlenn) [17:39:03] heh [17:39:15] figures. using it already too. [17:39:22] it's probably still a bad cable, I'll update the ticket with various info though [17:39:24] * apergos goes to abandon the other changeset [17:39:37] :) [17:40:24] (03Abandoned) 10ArielGlenn: log auditing tool plus wrapper script to audit each cluster [operations/software] - 10https://gerrit.wikimedia.org/r/136741 (owner: 10ArielGlenn) [17:41:09] paravoid: there is no such thing as affiliation to AT [17:41:11] and indeed the salt master is trying to pick up its job cache again [17:41:20] ok [17:41:21] can you ask your question more directly? :) [17:41:37] I linked them your patch and ganglia graphs in their channel [17:41:46] Nemo_bis: their bot is not obeying robots.txt and hence is banned from at least one server of ours [17:42:11] and that solved it right? or is something new happening? anyway I have no control over it [17:43:22] it solved the issue for us, yes [17:43:27] good [17:43:30] Nemo_bis, why are we being archived at all? [17:43:55] I guess the person who triggered the bot likes wikimedia [17:44:03] I just heard you had something to do with them or something, so if you were, I thought it'd be nice to let you know :) [17:44:19] I forwarded them the link like 2 min after it was merged [17:44:22] PROBLEM - DPKG on aluminium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [17:44:25] ok [17:45:27] noone has still replied on-list for the WLM domains btw [17:45:39] but I'm the bad guy for requesting such things to be properly coordinated with us [17:46:34] probably jeremyb will figure out their needs and file RT tickets [17:48:09] FYI the ArchiveTeam bot is just an IRC-commanded bot which swallows smallish websites [17:49:25] @James_F apologies, another friendly reminder to deploy visual editor onto Wikimania 2014 wiki [17:52:37] (03CR) 10Yuvipanda: "(Manager approval received on RT)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/140646 (owner: 10Yuvipanda) [17:53:21] mutante: springle can I convince either of you to +2 ^? :) has been 3 days as well I reckon [17:54:07] (03PS1) 10BBlack: enable ganglia aggregator on lvs300[12] [operations/puppet] - 10https://gerrit.wikimedia.org/r/141477 [18:00:25] edsaperia: It's scheduled for 16:00 SF-time today. [18:00:34] \o/ [18:00:40] I love 16:00 [18:01:33] ᕕ( ᐛ )ᕗ [18:04:02] (03CR) 10BBlack: [C: 032] enable ganglia aggregator on lvs300[12] [operations/puppet] - 10https://gerrit.wikimedia.org/r/141477 (owner: 10BBlack) [18:18:06] (03CR) 10MarkTraceur: [C: 031] Remove whitelist entry for now-graduated MediaViewer BetaFeature [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141054 (owner: 10Jforrester) [18:39:06] (03CR) 10CSteipp: [C: 031] "As I put on the rt ticket, I think this feature is a good idea. I'll leave it to others to make sure the vcl syntax is correct and tested." [operations/puppet] - 10https://gerrit.wikimedia.org/r/141086 (owner: 10Ori.livneh) [18:44:05] (03PS1) 10Hashar: zuul: mv python-voluptous in the array of packages [operations/puppet] - 10https://gerrit.wikimedia.org/r/141486 [18:44:10] (03PS1) 10Hashar: zuul: migrate merger definitions to merger.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/141487 [18:44:10] (03PS1) 10Hashar: zuul: migrate server definitions to server.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/141488 [18:51:44] mark: thoughts on https://gerrit.wikimedia.org/r/#/c/141086/ ? [18:51:52] if you have a moment [18:53:35] (03CR) 10Mark Bergsma: [C: 031] [HAT] text-frontend VCL: set Content-Type if not set [operations/puppet] - 10https://gerrit.wikimedia.org/r/141086 (owner: 10Ori.livneh) [18:53:39] danke [18:53:43] i appreciate it [18:54:22] what is HAT? [18:55:41] HHVM, Apache 2.4 and Trusty [18:55:45] i'm trying to make that a thing [18:55:45] lol [18:55:53] a nod to LAMP [18:55:55] almost fell for it :-P [18:55:56] haha [18:56:04] (03CR) 10Hashar: [C: 031 V: 032] "Tested on a labs instance having Zuul. All fine!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141442 (owner: 10Hashar) [18:56:11] (03PS1) 10Faidon Liambotis: authdns: move standard include from role to site.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/141490 [18:56:13] (03CR) 10Hashar: [C: 031 V: 032] "Tested on a labs instance having Zuul. All fine!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141486 (owner: 10Hashar) [18:56:14] apergos: ^ [18:56:23] :) [18:56:26] ahhh [18:56:28] (03CR) 10Hashar: [C: 031 V: 032] "Tested on a labs instance having Zuul. All fine!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141487 (owner: 10Hashar) [18:56:39] (03CR) 10Hashar: [C: 031 V: 032] "Tested on a labs instance having Zuul. All fine!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141488 (owner: 10Hashar) [18:56:45] making a note for tomorrow (I"m pretty much done for the day today) [18:56:53] (03CR) 10Faidon Liambotis: [C: 032] authdns: move standard include from role to site.pp [operations/puppet] - 10https://gerrit.wikimedia.org/r/141490 (owner: 10Faidon Liambotis) [18:57:11] sure, I just did it now so I won't forget :) [18:58:09] yep, I have that open in a tab for tomorrow so *I* won't forget :-) [18:58:22] PROBLEM - swift-object-server on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [18:58:25] ori: hello. Apparently you updated puppet stdlib module. It comes with a definition ensure_packages() that would let us get rid of all if ! defined Package['used-everywhere'] stanza :-] [18:58:52] hashar: i thought about it long and hard and i think we should do it for the mediawiki module [18:58:52] PROBLEM - swift-container-updater on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [18:58:56] but you can expect that to be a hard sell [18:59:02] PROBLEM - swift-account-server on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [18:59:02] PROBLEM - swift-account-auditor on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [18:59:03] because _that's_ what I've been missing from our puppet tree [18:59:07] moar magic [18:59:08] *g* [18:59:26] because while most folks resent puppet balking at duplicate package definitions, they appreciate the discipline it forces on you to separate out your application stack definitions [18:59:32] PROBLEM - swift-container-server on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [18:59:32] PROBLEM - swift-object-replicator on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [18:59:32] PROBLEM - swift-object-auditor on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [18:59:42] PROBLEM - swift-container-replicator on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [18:59:50] but for grab-bag things like mediawiki's packages.pp, which defines a whole pile of packages that app servers depend on, i think it makes sense [18:59:51] jgage: you should probably !log that all that is you :) [19:00:12] PROBLEM - swift-container-auditor on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [19:00:12] PROBLEM - swift-account-replicator on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [19:00:12] PROBLEM - swift-account-reaper on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [19:00:22] because it's a common use-case for various services that work with mediawiki to require the same set of packages (like contint, etc.) [19:00:24] ori: I need the mediawiki packages on the contint jenkins slaves but some package conflict with other manifests I have :-( [19:00:28] right [19:00:31] ori: https://gerrit.wikimedia.org/r/#/c/138804/ [19:00:32] PROBLEM - swift-object-updater on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [19:00:42] ori: that Gerrit change got merged once but caused puppet failures all over the place iirc [19:00:51] so +1 for ensure_packages in mediawiki/maniafests/packages.pp [19:00:57] but see what other folks think [19:01:06] manifests, not maniafests ;) [19:01:38] ori: glad you like the idea :-] [19:01:54] hashar: want me to submit a patch? [19:01:59] ori: na it is broken [19:02:09] for ensure_packages in packages.pp i mean [19:02:35] see if we can convince paravoid it's a good idea :P [19:02:37] the conflict I had was with apache2-mpm-prefork and libapache2-mod-php5 packages [19:02:46] both being in mediawiki::packages and web server::php5 [19:03:07] my change https://gerrit.wikimedia.org/r/#/c/138804/ is/was a lame attempt to factor out both packages definition to some new wrapping classes [19:03:16] but ensure_packages() might render that useless (and simpler) [19:04:26] (03CR) 10Hashar: "Ori updated the stdlib puppet modules that comes with a ensure_packages(). That would solve the duplicate definitions we encountered earli" [operations/puppet] - 10https://gerrit.wikimedia.org/r/138804 (owner: 10Hashar) [19:04:32] RECOVERY - Disk space on ms-be3003 is OK: DISK OK [19:07:02] RECOVERY - swift-account-server on ms-be3003 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [19:07:03] RECOVERY - swift-account-auditor on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [19:07:12] RECOVERY - swift-account-replicator on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [19:07:13] RECOVERY - swift-container-auditor on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [19:07:13] RECOVERY - swift-account-reaper on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [19:07:22] RECOVERY - swift-container-server on ms-be3003 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [19:07:32] RECOVERY - swift-object-auditor on ms-be3003 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [19:07:32] RECOVERY - swift-object-updater on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [19:07:42] RECOVERY - swift-object-replicator on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [19:07:42] RECOVERY - swift-object-server on ms-be3003 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [19:07:42] RECOVERY - swift-container-replicator on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [19:07:52] RECOVERY - swift-container-updater on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [19:16:12] PROBLEM - swift-account-reaper on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [19:16:22] PROBLEM - swift-container-server on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [19:16:32] PROBLEM - swift-object-updater on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-updater [19:16:32] PROBLEM - swift-object-auditor on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [19:16:42] PROBLEM - swift-object-replicator on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [19:16:42] PROBLEM - swift-object-server on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [19:16:42] PROBLEM - swift-container-replicator on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [19:16:52] PROBLEM - swift-container-updater on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-updater [19:17:02] PROBLEM - swift-account-server on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [19:17:02] PROBLEM - swift-account-auditor on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [19:17:12] PROBLEM - swift-account-replicator on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [19:17:12] PROBLEM - swift-container-auditor on ms-be3003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [19:17:41] oops sorry for the noise, that's me [19:18:42] RECOVERY - swift-object-server on ms-be3003 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [19:18:42] RECOVERY - swift-container-replicator on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [19:18:52] RECOVERY - swift-container-updater on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [19:19:02] RECOVERY - swift-account-server on ms-be3003 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [19:19:02] RECOVERY - swift-account-auditor on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [19:19:12] RECOVERY - swift-account-replicator on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [19:19:12] RECOVERY - swift-container-auditor on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [19:19:12] RECOVERY - swift-account-reaper on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [19:19:22] RECOVERY - swift-container-server on ms-be3003 is OK: PROCS OK: 25 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [19:19:33] RECOVERY - swift-object-updater on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [19:19:33] RECOVERY - swift-object-auditor on ms-be3003 is OK: PROCS OK: 2 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [19:19:51] RECOVERY - swift-object-replicator on ms-be3003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [19:20:30] !log ms-be3003 full root partition fixed, swift had written to /srv/swift-storage/sdk1 onto root due to umounted sdk1 [19:20:36] Logged the message, Master [19:22:26] !log gallium / zuul : deleting /var/lib/zuul/git old Zuul repositories. They have been migrated to /srv/ssd/zuul/git/ ages ago [19:22:31] Logged the message, Master [19:31:51] (03CR) 1001tonythomas: "* now, after removing errors_to: from the exim4.conf, and adding the '-f verp-return_path@wikimedia.org' to the $wgAdditionalMailParams in" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141287 (owner: 1001tonythomas) [19:38:28] (03CR) 10Faidon Liambotis: [C: 04-1] "Yes, errors_to should be removed entirely. It should *not* be set to the value of the Return-Path header; that header should not be set by" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141287 (owner: 1001tonythomas) [19:39:33] tonythomas: ^ [19:39:46] paravoid: hi [19:40:03] hello [19:40:47] paravoid: oh. so we can remove errors_to ? [19:40:56] yeah [19:42:08] but, as you would've seen in the raw mail :- the gmail authentication problems gave 'neutral' rather than pass. and in the inbox, I receive an additional warning from gmail :- This message may not have been sent by: 01tonythomas@gmail.com Learn more Report phishing [19:42:29] might be because I'm inside the labs net ? [19:43:02] that's because you're trying to send as @gmail.com from Labs [19:43:20] in production, we'd still send from NNN-@wikimedia.org [19:43:35] and we'd be authoritative for that domain [19:44:16] ok. then no probs. will submit a patchset getting the removal from mail.ini too [19:44:26] is mediawiki passing -f now? [19:45:39] I think so -- but I hadn't tried this case :- no errors_to and no additionalparams. Should check that one too [19:45:59] which version of mediawiki has that? [19:47:24] actually, I found it here https://github.com/wikimedia/mediawiki-core/blob/master/includes/UserMailer.php#L374 [19:48:03] and where is that defined as -f ... ? [19:48:42] nah. I just gave it as a test [19:49:11] that's the flag you'd throw with postfix's sendmail binary [19:49:12] mediawiki should pass -f wiki@wikimedia.org *before* we merge a change in puppet that removes that [19:49:53] ah exim too [19:49:53] (03PS1) 10Hashar: zuul: prefix server default template with 'zuul.' [operations/puppet] - 10https://gerrit.wikimedia.org/r/141501 [19:49:55] (03PS1) 10Hashar: zuul: merger now has its own default file [operations/puppet] - 10https://gerrit.wikimedia.org/r/141502 [19:50:13] Jeff_Green: we have a php.ini override for -f, that doesn't allow mediawiki to set an arbitrary envelope from [19:50:19] Jeff_Green: that sets it to <> [19:50:26] ahh [19:50:27] Jeff_Green: then, we have an exim router option that forces it to wiki@wikimedia.org [19:50:33] (errors_to) [19:50:37] yeah I saw the exim router [19:50:44] didn't know about the php one too [19:50:44] we can remove both, as long as mediawiki is prepared to send a correct -f [19:50:48] I don't think that's the case now [19:50:55] seems like someone REALLY didn't want this beast to handle bounce mail :-) [19:51:04] :) [19:51:32] looks like mw parsed -f correctly -- I just removed the -f option and the mail came from root@wikimedia.org [19:52:00] (03CR) 10Hashar: [C: 031 V: 032] "tested on labs instance. It is a noop." [operations/puppet] - 10https://gerrit.wikimedia.org/r/141501 (owner: 10Hashar) [19:52:54] (03PS2) 10Hashar: zuul: merger now has its own default file [operations/puppet] - 10https://gerrit.wikimedia.org/r/141502 [19:55:15] (03CR) 10Hashar: [C: 031 V: 032] "PS1 had a lame typo. Tested PS2 on a labs instance, the zuul-merger is launched with:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141502 (owner: 10Hashar) [20:00:05] gwicke, subbu: The time is nigh to deploy Parsoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140623T2000) [20:00:59] RECOVERY - DPKG on aluminium is OK: All packages OK [20:05:56] !log deployed parsoid 392435a2 (deploy sha db94f88c) [20:06:01] Logged the message, Master [20:11:38] (03PS1) 10Ori.livneh: [HAT] text-frontend VCL: set Content-Type if not set [operations/puppet/varnish] - 10https://gerrit.wikimedia.org/r/141556 [20:11:50] ^ bblack [20:11:55] different patch because different repo [20:13:00] I'll need to follow-up with a commit to update the submodule [20:15:04] (03CR) 10Ori.livneh: [C: 04-2] "Doing this in wikimedia.vcl in varnish submodule instead; see https://gerrit.wikimedia.org/r/#/c/141556/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141086 (owner: 10Ori.livneh) [20:31:17] (03PS21) 10Nikerabbit: cxserver configuration for beta labs [operations/puppet] - 10https://gerrit.wikimedia.org/r/139095 [20:38:24] (03CR) 10BBlack: [C: 032] [HAT] text-frontend VCL: set Content-Type if not set [operations/puppet/varnish] - 10https://gerrit.wikimedia.org/r/141556 (owner: 10Ori.livneh) [20:38:35] weee [20:39:00] you ahve subodule update already or want me to? [20:39:09] wow my typing is horrible, maybe you should :) [20:39:20] heh already on it [20:39:32] (03PS5) 1001tonythomas: Removed exim errors_to to support custom Return-Path [operations/puppet] - 10https://gerrit.wikimedia.org/r/141287 [20:39:59] (03PS1) 10Ori.livneh: Update varnish submodule for 9b9484ba87 [operations/puppet] - 10https://gerrit.wikimedia.org/r/141565 [20:40:28] bblack: ^ [20:41:24] (03CR) 10BBlack: [C: 032 V: 032] Update varnish submodule for 9b9484ba87 [operations/puppet] - 10https://gerrit.wikimedia.org/r/141565 (owner: 10Ori.livneh) [20:41:32] (03PS1) 10Yuvipanda: tools: Install openbabel [operations/puppet] - 10https://gerrit.wikimedia.org/r/141566 (https://bugzilla.wikimedia.org/66995) [20:41:59] (03PS2) 10Yuvipanda: tools: Install openbabel [operations/puppet] - 10https://gerrit.wikimedia.org/r/141566 (https://bugzilla.wikimedia.org/66995) [20:42:18] (03Restored) 10Ori.livneh: mw/apache 2.4 compat: remove DefaultType directive [operations/puppet] - 10https://gerrit.wikimedia.org/r/138891 (owner: 10Ori.livneh) [20:43:08] (03CR) 10Faidon Liambotis: [C: 04-1] "You need to ensure => absent the file for the change to actually take effect." [operations/puppet] - 10https://gerrit.wikimedia.org/r/141287 (owner: 1001tonythomas) [20:44:00] ori: you get to redo both patches! :) [20:44:13] VCC-compiler:#012Symbol not found: 'LOG_USER' (expected type INT): [20:44:42] fuck me [20:44:53] probably a C-block for #include [20:45:17] handy that the varnish docs document that usage without mentioning it :P [20:45:40] should we include it just for the defines, or hard code '9'? [20:45:57] #define LOG_ALERT 1 [20:46:05] #define LOG_USER (1<<3) [20:46:27] hardcode is probably easier, since this is temporary either way [20:46:54] (1<<3)|1 = 9 [20:47:01] that's right, yeah? i'm bad at math :P [20:47:06] yes, that's right :) [20:49:07] (03PS1) 10Ori.livneh: Hard-code syslog facility/priority [operations/puppet/varnish] - 10https://gerrit.wikimedia.org/r/141571 [20:50:09] ^ bblack [20:50:10] (03PS1) 10Hashar: zuul: split conf file for server and merger [operations/puppet] - 10https://gerrit.wikimedia.org/r/141572 [20:50:41] (03PS6) 1001tonythomas: Removed exim errors_to to support custom Return-Path [operations/puppet] - 10https://gerrit.wikimedia.org/r/141287 [20:51:15] (03CR) 10BBlack: [C: 032] Hard-code syslog facility/priority [operations/puppet/varnish] - 10https://gerrit.wikimedia.org/r/141571 (owner: 10Ori.livneh) [20:51:57] (03PS1) 10Ori.livneh: Update varnish submodule for 87471cea21 [operations/puppet] - 10https://gerrit.wikimedia.org/r/141573 [20:52:57] (03CR) 10BBlack: [C: 032 V: 032] Update varnish submodule for 87471cea21 [operations/puppet] - 10https://gerrit.wikimedia.org/r/141573 (owner: 10Ori.livneh) [20:54:45] (03PS4) 10Ori.livneh: [HAT] Remove DefaultType directive from Apache config [operations/puppet] - 10https://gerrit.wikimedia.org/r/138891 [20:54:48] (03CR) 10Hashar: [C: 031 V: 032] "Confirmed to works on a labs instance \O/" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141572 (owner: 10Hashar) [20:55:13] ori: lol, we default-type like, everything [20:55:31] bblack: wait, really? [20:55:40] well, not everything, but a shitload of stuff [20:56:13] small random snippet from cp1052 where I puppeted manually: http://paste.debian.net/106439/ [20:56:33] the /check is the pybal health check [20:56:48] what about things like: /wiki/Ed_Hughes_(anchor) [20:57:31] (which, when I hit it with my browser, has Content-Type text/html) [20:57:45] for me too [20:58:46] so? [20:59:03] in any case, let's revert [20:59:10] it's so spammy it's making rsyslog drop messages [20:59:45] looks a lot like "everything", so maybe the patch is just wrong from a VCL perspective [20:59:52] something about when you can actually change/read what [21:02:14] PROBLEM - DPKG on osmium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [21:02:20] that's me [21:02:33] bblack: it must be wrong [21:02:36] hrm [21:03:28] bblack: should it be beresp? [21:03:54] (03PS1) 10BBlack: Revert commits 9b9484ba + 87471cea [operations/puppet/varnish] - 10https://gerrit.wikimedia.org/r/141574 [21:04:11] (03CR) 10BBlack: [C: 032 V: 032] Revert commits 9b9484ba + 87471cea [operations/puppet/varnish] - 10https://gerrit.wikimedia.org/r/141574 (owner: 10BBlack) [21:05:26] bblack: per the chart you found, beresp is not available in vcl_deliver: https://www.varnish-software.com/static/book/VCL_functions.html [21:05:27] hmm [21:07:32] (03CR) 10Gage: [C: 031] "Changes vs PS5 LGTM!" [operations/puppet/cdh4] (cdh5) - 10https://gerrit.wikimedia.org/r/135494 (owner: 10Ottomata) [21:07:51] i don't get it [21:07:58] "VCL - vcl_deliver: Often used to add and remove debug-headers " [21:08:03] "If you need to remove a header, or add one that isn’t supposed to be stored in the cache, vcl_deliver is the place to do it." [21:08:11] mutante, back? [21:08:30] I'm wondering if you made a final determination about how long the puppet freshness check is [21:08:34] (03PS1) 10BBlack: update varnish module to 167e29e1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/141575 [21:08:46] (03CR) 10BBlack: [C: 032 V: 032] update varnish module to 167e29e1 [operations/puppet] - 10https://gerrit.wikimedia.org/r/141575 (owner: 10BBlack) [21:09:21] (yes, 4 minutes is how long I spent re-figuring-out how to update the submodule) [21:09:49] (03PS2) 10Andrew Bogott: Intentionally break puppet compile for virt1008 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140745 [21:10:06] bblack: sorry, i should have done it [21:10:29] arguably, on top of the idea that our varnish module isn't ready to be a submodule, git's design for how submodules work is horrible [21:11:15] anyways, I'm salting puppet on the caches to kill it quicker [21:12:02] do you have any idea why it didn't work? [21:12:09] not really, not yet [21:12:27] could try to figure it out in beta [21:12:43] yeah, i had the previous patch on beta [21:12:50] i'll try [21:14:01] also, while watching puppet stream by on all the caches, I see we've got lingering VCL compilation failures going on in various places due to various older changes [21:14:13] we should really find a way to tie that into an icinga check or something [21:15:07] (03CR) 10Andrew Bogott: [C: 032] Intentionally break puppet compile for virt1008 [operations/puppet] - 10https://gerrit.wikimedia.org/r/140745 (owner: 10Andrew Bogott) [21:16:28] this one is fabulous: http://paste.debian.net/106440/ [21:17:08] who ever heard of a language where you can't do #include or import or whatever your statement-of-choice is twice? [21:19:22] and stackoverflow is dead too. that's like a solid 15% of my brain gone missing. [21:20:56] we have at least 3 separate ongoing issues with failed VCL reloads. That one with the duplicate import, a failure on bits caches because there's no such backend test_wikipedia, and the usual GeoIP shlib stuff [21:22:49] i think we saw a lot of DefaultTypes because of redirects [21:23:23] ah [21:23:33] springle: are you awake? :) [21:23:55] that makes some sense. perhaps the if() should also only be doing things on content responses [21:24:09] (but I guess that includes HEAD on them as well, so it can't be "is there a body?" [21:24:23] maybe only on 2xx response code? [21:25:47] what really sucks about the failed VCL reloads is they don't get tried again until the next VCL change. So it's not like we even have a persistent failing puppet run to let us know something's amiss. [21:26:49] well, HTTP 204s don't have a body [21:26:49] there's an impedance mismatch there at the puppet level. In theory you'd want puppet to keep redoing that command on every puppet run until it succeeds, even though the triggering file remains constant (at the new, triggering contents). [21:28:20] varnishd lets you load vcl by string [21:28:49] in which case you can give it an arbitrary name [21:29:08] we could have a puppet resource that reads the vcl file and loads it into varnish [21:29:10] I guess you could force a more-correct behavior by replacing every such construct in your puppet manifests with a recipe that fixes it. You have file contents changing instead trigger an exec that just creates a file like "/etc/varnish-needs-reload", and then have the real exec trigger on the presence of that file, and remove the file on successful completion [21:29:13] it also lets you do two-stage deploys [21:29:18] but, ick, puppet should handle that itself :P [21:29:31] (03CR) 10Tim Landscheidt: [C: 031] tools: Install openbabel [operations/puppet] - 10https://gerrit.wikimedia.org/r/141566 (https://bugzilla.wikimedia.org/66995) (owner: 10Yuvipanda) [21:31:02] ori: the problem runs deeper than just varnish really. it could happen for any kind of "execute this reload command when this file changes" situation. because puppet doesn't care, in the long run, if the reload command failed. That causes one failed puppet run, but then the state information that "hey a file changed and this command needs to run (successfully)" is gone. [21:31:14] bblack: you could use the 'unless' contruct in an exec to continually run a command until vcl reloads ok? [21:31:18] bblack: if ((resp.status == 200 || resp.status == 201 || resp.status == 202 || resp.status == 203 || resp.status == 206) && !resp.http.Content-Type) [21:31:22] that's where i ended up [21:31:25] how does that look? [21:31:45] unless && onlyif in an exec cover cases such as this usually [21:31:47] it excludes 204 (No Content) and 205 (Reset Content), which don't contain a response body [21:32:12] chasemp: how? [21:32:50] well, I should say: how succinctly and sanely? [21:33:10] there are ways to do it that look ugly, like managing your own statefiles for successful config reloads as mentioned above. [21:33:18] is there a command you can run that checks for failed VCL reloads? I don't know varnish well [21:33:41] (and a command like that for every other similar case? what happens if rsyslogd fails to reload rsyslogd.conf?) [21:34:23] if I have a case where say I want to run a command every time puppet runs except in thase case of already good state [21:34:35] but that's not what we want [21:35:04] ok then don't mind me, I don't understand the case well enough [21:35:33] we want to run a command when puppet changes the contents of a file (which we already have), but if the command fails we don't want puppet to ignore that going forward [21:35:58] (in the next puppet run) [21:37:36] it just seems like a big oversight in puppet's design to me. unless you go with the idea that a single puppet run failure should be enough of an indicator of a problem. [21:37:37] does the command in question provide sane exit codes...i.e. [21:37:42] yes [21:37:42] 0 on sucess, 1 on failure [21:40:49] is it accurate to say it's not the state of the file but the state of VLC reloads your interested in, only the state of the file is the indicator for change. I don't know, I would think an exec with an unless and the command in the command field would suffice. If the command that checks VCL status fails it will run the command to correct it no matter if this is the run we updated the file or not [21:41:05] but I think I don't have a good picture of the issue, and other solutions are as hacky as you suggest [21:41:21] using 'creates' and then removing that when teh file is updated [21:41:36] so it will continually try until success to run an exec until that is satisfied [21:42:04] yeah, wrapping the exec and managing a statefile is certainly possible, it just seems messy to do that for every such case. [21:42:32] (and it's not always exec, it can be a service reload too, so you'd need to customize the service with a different reload command that did the wrapper to remove the statefile on success) [21:43:04] sure but the service shouldn't be running if it can't restart and if it uses reload it should sanely throw a 1 exit code and fail every subsequent puppet run [21:43:28] it's turtles all the way down [21:43:28] the case here is that it uses "reload" to process the change in this file [21:43:47] ah I see now [21:43:48] and it does throw a 1 exit code, which makes puppet barf in purple during the one run when puppet changed that file [21:44:00] but the next run puppet is happy, and the changes remain un-reloaded [21:44:32] yeah the only solution I know of is to put an exec in the change between the file and the service [21:44:44] change = chain [21:45:00] when you consider the general case for all daemon config reloads, and that the reasons for the reload failure could be transient and totally retryable, or fixed by something else unrelated later [21:45:18] (and that it's nice to have a persistent error instead of a persistent lack of knowledge that something wasn't reloaded) [21:45:22] it does seem like a gaping whole for puppet [21:45:28] hole even [21:45:42] I'd rather puppet maintained the state that "hey this command has never run succesfully since it was triggered, let's keep trying on future runs" [21:45:59] yeah that's more or less why they added teh 'creates' option to exec [21:46:13] which is maybe bad because they have not come up with something better [21:46:15] (03PS1) 10Ori.livneh: Reattempt Ia36d0d89f with stricter checks [operations/puppet/varnish] - 10https://gerrit.wikimedia.org/r/141584 [21:46:47] usually I have done command => 'does_this_work && touch /my/creation' kind of business [21:46:53] bblack: this does the right thing in labs, but the volume of request on labs is still so low that there could be an issue in prod. before we go through the submodule update, could we try it on a single random varnish? [21:46:57] but I understand now, and your right it's weird [21:46:59] we could probably subclass "service" with our own definition that handles that case for service-reload generically [21:47:00] this == the patch above [21:48:34] brr, sorry [21:48:58] chasemp: actually in this very particular case, it so happens this is an exec of "/usr/share/varnish/reload-vcl" rather than service-reload, but really it's the same problem either way if the command doesn't happen to manage a statefile automatically in a way that's useful for onlyif/unless [21:49:00] (03PS2) 10Ori.livneh: Reattempt Ia36d0d89f with stricter checks [operations/puppet/varnish] - 10https://gerrit.wikimedia.org/r/141584 [21:49:23] understood, was just trying to help [21:49:44] in a truly sane world a service could throw an non-1, non-0 exit code for 'I am running but my config file state is stale [21:49:57] which would handle this case [21:50:00] I believe [21:50:08] yeah [21:50:12] well, for services [21:50:27] now that I think about it, I'm surprised they don't [21:50:43] well, they might in some cases, but that would probably be a per-distro / per-init-system thing [21:50:53] even big projects don't follow lsb standards [21:51:02] I had to write a bunch of init wrapper for heartbeat before [21:51:17] because starting a started service and stopping a stopped service wasn't handled 'correctly' all over the place [21:51:17] yeah I remember looking at the LSB exitcodes standards once before [21:51:44] but even in a mostly-ideal world, the distro packager should be handling that, since it's very system-dependent. [21:52:08] You'll never get Linux + NetBSD + Hurd + WhateverSolarisIsNow to agree on the meaning of those exit codes. [21:52:27] or even 3 different linux distros [21:52:33] sometimes nix means the freedom to make a new standard [21:52:41] no one is following the standard....we shall make a new standard [21:53:01] you know [21:53:05] you could set hasreload=false [21:53:12] and then it would truly break stuff and be noticed :) [21:53:29] :) [21:53:56] luckily failed varnish reloads leave the old state in place. but at this point I'm not sure how far behind some of these classes of caches are. they could be missing important vcl updates :/ [21:54:51] I suppose then /usr/share/varnish/reload-vcl with exec and an unless statement is the best we can do [21:55:15] I haven't been getting Gerrit commit email for the past few hours (list mediawiki-commits). Known issue? [21:55:28] Missing at least 100 or so mails. [21:56:41] well there's two basic hack paths there: (1) You could wrap reload-vcl with an outer script that creates the "heyifailed.state" file if non-zero exit status and removes it on zero exit status, and have a second copy of the exec command (aside from the one that subscribes to changes) that's onlyif on that statefile [21:57:54] well and my (2) idea gets more ridiculously complicated every time I try to type it out, so it probably sucks [21:58:15] what's the issue? [21:58:54] that vcl reload failures are silently persistent after one failed puppet run [21:59:13] (until the next time someone edits vcl for that host, if it gets noticed then) [22:02:07] if it always should succeed could you just make a plain jane exec no unless or anything, if it fails it borks the run and if it succeeds no problem? I have used unless and cover and then a basic /bin/false in the command field to prevent it from showing constant churn on good runs [22:02:17] but IMHO the problem is more-generic than that, it's a puppet design-fail. The generic version of the problem is that if you changed e.g. rsyslogd.conf in puppet and had that trigger "service rsyslogd reload", and that command fails (exit(1)), you get a failed puppet run, but then no indicator that rsyslogd's config is out of date from there forward. [22:02:21] anyways, last of my meddling [22:03:26] chasemp: I agree that's an easy fix as well, just unconditional reloads. But if you applied that in the general case it's a lot of needless churn on the host. [22:04:13] (and in the varnish case in particular, varnish actually stores every compiled VCL in a list in memory so that the old ones can expire as children die, etc. It's probably not a good idea to reload vcl constantly when there's no real change) [22:04:40] andrewbogott: now i am [22:04:44] ah there isn't a way to check for needed reload but no actually do it? [22:04:49] YuviPanda|zzz: ok [22:05:09] chasemp: not that I'm aware of yet. the way we know we need a reload is "puppet changed the VCL files" [22:06:15] (03CR) 10Dzahn: [C: 032] Grant bearND ability to upload mobile releases [operations/puppet] - 10https://gerrit.wikimedia.org/r/140646 (owner: 10Yuvipanda) [22:07:18] forgetting "puppetd -tv" ain't easy :p [22:07:44] to me the ideal solution for the exec case would be a new parameter "persistent_retry", which means to puppet "if conditions triggered this exec and it exited non-zero, save some state and keep re-running this command on future puppet runs until it succeeds, even if the triggering conditions are not currently present" [22:08:35] and then something equivalent for service definitions to persistently retry "reload" the same way (does restart need it as well? can 'restart' fail and leave the daemon running?) [22:08:40] I guess so [22:09:18] but then it needs state between runs which would have to be something almost like exec creates :D [22:09:49] yeah I just want that state to be managed internally by puppet, not by me with hacks in our puppet config for every such case. [22:10:19] makes sense [22:11:03] there are probably odd design issues to consider taking into account all the strange use-cases out there [22:11:15] (03PS1) 10Tim Landscheidt: Tools: Install xsltproc [operations/puppet] - 10https://gerrit.wikimedia.org/r/141588 (https://bugzilla.wikimedia.org/66962) [22:11:24] like whether you retry the exec command as it was when it failed, or as it is now (given the command itself could have templated parameters that change over time) [22:12:07] I know puppet does store a copy of the file it changes, like you change foo it backs it up to /var/puppet/foo.fljasdlfja [22:12:20] there should be an option taht says, later if this fails put the original back and try again [22:12:33] you could actually make an exec or a puppet type that does that [22:13:12] bblack: were you up for retrying the VCL change? [22:13:57] ori: let's do it sometime later, I need to run out for dinner and pick up my dog, so I won't be around to see what happens [22:14:09] sure, ok [22:17:30] bblack: https://github.com/andytinycat/puppet-clientbucket-restore [22:17:38] maybe put in a refreshonly on the service [22:17:42] that puts back the old file on fail [22:17:51] thereby causing it to replace with and retry next time [22:17:55] and leaving it in a knowngood state [22:17:57] in the meantime [22:18:30] I meant a refreshonly exec post service if the reload fails, or something to that affect [22:18:37] effect even [22:19:51] looks like you can backup to the server but I have never used it at all http://docs.puppetlabs.com/references/2.6.7/type.html#filebucket [22:20:25] hmmm interesting [22:20:47] I think putting the old copy back if it can't start with the new one is ideal ? [22:21:00] that's certainly cleaner than the hacky methods, and the nice side-effect is the running VCL and the on-disk VCL match when you're looking at things. [22:22:22] it's a bunch of VCL files that can change that trigger the exec, there would have to be an easy way to specify all of this [22:22:41] anyways, I really gotta run before I'm late to pick up the dog from the weekend. [22:28:00] later man, good luck with it, hit me up if you want to bounce ideas [22:35:34] "I'd like to be able to convert molecules." wondered how software could do that :) [22:36:08] but http://packages.ubuntu.com/precise/science/openbabel "Chemical toolbox utilities (cli)" [22:42:19] (03CR) 10Dzahn: [C: 032] "chemistry tools like "Convert between various chemical file formats", "Calculate the energy for a molecule" etc. for User:Rillke (http://c" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141566 (https://bugzilla.wikimedia.org/66995) (owner: 10Yuvipanda) [22:45:47] chasemp: andrewbogott: known ? Duplicate declaration: File[/etc/sudoers] on tools-exec [22:46:09] /etc/puppet/modules/admin/manifests/init.pp:42; cannot redeclare at /etc/puppet/manifests/sudo.pp:37 [22:46:12] not known to me [22:47:19] I don't know where a labs host would be getting admin from in the first place [22:47:24] to conflict I mean [22:47:37] what box is this? [22:47:45] chasemp: could be RT #7732 ? [22:47:52] chasemp: tools-exec-03 [22:47:56] tool labs [22:48:16] labstore hosts had admin commented out for now [22:48:20] so it shouldn't be raising errors atm [22:48:51] ah [22:49:01] could this host be catching the default node def? [22:49:10] recently added admin to it so we would get users on new installs [22:49:32] interaction between site.pp and labs to me is slightly opaque [22:50:44] hmm, it should be different from labstore though [22:51:10] these are actually virtual [22:51:31] when you run puppet on one of the tool hosts does it match the default node def in site.pp? [22:52:02] i think it doesnt look at site.pp but just at the config within wikitech [22:52:09] let me confirm that..brb [22:52:11] that was my thought [22:52:20] but that seems impossible if it's picking up the admin module [22:52:30] or at least I have no idea how else it would be seeing that conflict [22:54:57] mutante: I will look [22:55:11] …unless you've already sorted it? [22:55:34] ah, back.. [22:55:42] so i just see this one class configured on the instance [22:55:44] role::labs::tools::execnode [22:55:53] andrewbogott: hi, no i have not [22:55:59] yeah, that sounds right. There are a few default things that are defined in ldap though [22:56:00] how can that even see "admin" [22:56:04] is the current question [22:57:46] mutante: in ldap I see role::labs::instance, role::labs::tools::execnode, sudo::labs_project [22:57:47] for that instance [22:58:10] so that's where the first definition comes from...but how about the second [22:58:40] mutante: is it reasonable to remove the default node include admin and just check that way? [22:58:47] but if that fixes this...that seems not great [22:58:54] andrewbogott: that doesn't seem to match what is checked when going to "configure instance" [22:59:13] all instances have a couple of classes added on creation [22:59:22] ah, ok [22:59:39] is File[/etc/sudoers] is already declared in file… what you're seeing? [22:59:43] (That's what tools-exec-02 says) [22:59:48] andrewbogott: yes [22:59:50] andrewbogott: yep [23:00:00] o [23:00:01] ok [23:00:04] mwalker, ori, MaxSem: The time is nigh to deploy SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20140623T2300) [23:00:07] i'll do the SWAT [23:00:14] thanks max [23:01:28] chasemp: where is the default again? looks [23:01:32] bottom [23:01:37] (03CR) 10MaxSem: [C: 032] Reduce MediaViewer EventLogging rate [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140897 (owner: 10Gilles) [23:01:40] literally "default:" I think [23:01:41] in site.pp [23:01:53] (03Merged) 10jenkins-bot: Reduce MediaViewer EventLogging rate [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/140897 (owner: 10Gilles) [23:02:32] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 7.14% of data exceeded the critical threshold [500.0] [23:03:48] !log maxsem Synchronized wmf-config/InitialiseSettings.php: https://gerrit.wikimedia.org/r/#/c/140897/ (duration: 00m 04s) [23:03:53] Logged the message, Master [23:03:58] chasemp: i just see default including "standard" [23:03:58] tgr, ^^^ [23:04:03] please verify:) [23:04:39] node default { [23:04:39] include standard [23:04:39] include admin [23:04:41] } [23:04:43] pull? [23:04:50] yes :p [23:05:00] i see, so that is new [23:05:45] hmm, hmm, and all labs instances are getting this because they are default? [23:06:12] that is my theory :D [23:06:14] andrewbogott: doesnt it completely ignore site.pp ? [23:06:21] no [23:06:26] which if that's true is not cool in case something we wanted not in labs ever was put in standard [23:06:29] I think that's likely the problem. [23:06:44] So that default should have a big if $::realm around it [23:06:55] hrmm.. so standard needs an ugly "if $realm" check ? [23:06:56] MaxSem: works, thanks! [23:07:06] why not put it in site.pp in this one case? [23:07:10] and put a comment in about labs pickup [23:07:12] idk [23:07:18] <_joe_> +1 chasemp [23:07:19] seems like the right place considering the risk [23:07:23] yea, +1 [23:07:32] I'd also like it in site.pp, presuming realm is defined by then [23:07:33] <_joe_> site.pp if $::realm [23:07:56] <_joe_> andrewbogott: ::realm is, by def [23:07:56] want me to write a patch and run a test? [23:08:10] i seriosly thought a labs instance doesnt ever care about site.pp in any way [23:08:20] <_joe_> mutante: no what labs does is [23:08:44] <_joe_> it defines variables and includes roles from ldap at the TOP scope [23:08:53] <_joe_> then, goes on to apply the puppet tree [23:08:58] andrewbogott: yes please [23:09:02] <_joe_> via site.pp, as any puppet run [23:09:17] <_joe_> so standard IS in labs, and should be [23:09:17] so in theory we could play a trick with a 'node' match before the prod stuff and prevent it from being able to match lower [23:09:23] I mean in theory if you named a host in labs a prod host name [23:09:26] do you get prod host things? [23:09:26] _joe_: thanks! [23:09:41] assuming yes with this matching scheme [23:10:03] and since the master does private data fill in [23:10:05] that seems bad [23:10:07] chasemp: ugh, scary [23:10:10] <_joe_> I need to go to bed, tomorrow I wake up early and It's 1 AM [23:10:13] <_joe_> chasemp: no. [23:10:27] <_joe_> labs puppetmaster has private data from labs/private [23:10:30] <_joe_> ;) [23:10:34] okay so it would match [23:10:38] <_joe_> that's why we can use standard [23:10:41] but it wouldn't pull the wrong private data [23:10:59] <_joe_> exactly [23:11:44] <_joe_> so just either don't include admin in default node for labs, if that's an issue, or create ifs in the admin class [23:12:02] !log maxsem Synchronized php-1.24wmf10/extensions/MobileFrontend/: https://gerrit.wikimedia.org/r/#/c/141102/ (duration: 00m 05s) [23:12:07] Logged the message, Master [23:12:11] I guess for this one I would still vote if in site.pp with a note for anyone who uses teh default node def [23:12:16] <_joe_> (or, create a factory of the FileAdmins and LdapAdmins classes, but that's getting fancy and javists) [23:12:17] !log maxsem Synchronized php-1.24wmf9/extensions/MobileFrontend/: https://gerrit.wikimedia.org/r/#/c/141102/ (duration: 00m 06s) [23:12:23] Logged the message, Master [23:12:48] (03CR) 10Dzahn: [C: 031] Tools: Install xsltproc [operations/puppet] - 10https://gerrit.wikimedia.org/r/141588 (https://bugzilla.wikimedia.org/66962) (owner: 10Tim Landscheidt) [23:13:07] <_joe_> ok bye [23:13:15] _joe_: good night [23:13:25] dang, I have to build a new test box, this will take a few minutes [23:13:48] andrewbogott: are you going to put something to that affect up? You probably know labs best / can best comment [23:14:04] chasemp: yes, I'm working on it, just need a test instance [23:14:08] you would be the person asked if next time too :D [23:14:10] cool [23:14:32] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% data above the threshold [250.0] [23:14:32] well sorry guys I didn't realize it would match the default in labs for site.pp [23:16:31] btw, did some monitoring catch that one? [23:16:44] because we were just talking about freshness check and that one is an instance [23:16:51] but it's more important than any instance [23:17:00] it would be labs host...so no right? [23:17:14] not sure, there is also labs icinga [23:17:27] there used to be labs nagios but the fresshness thing has been an issue [23:17:48] Hah, but I can't build a testbox because labs is broken, woo [23:17:54] so, time for a blind commit :) [23:18:21] oh, chicken/egg problem? [23:19:54] (03PS1) 10Andrew Bogott: Apply site.pp defaults for production only. [operations/puppet] - 10https://gerrit.wikimedia.org/r/141598 [23:20:02] chasemp: ^ [23:20:51] (03CR) 10Rush: [C: 031] "cool, I would even throw a comment in there about this is necessary. It's not intuitive to me :) but I get it now. thanks!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141598 (owner: 10Andrew Bogott) [23:20:53] the "standard" part didnt seem to bother it before, but ... [23:21:21] technically any resource created in a labs manifest that shadows a prod one will cause a dupe error [23:21:27] so good one to have in the back of my mind I guess [23:21:31] in standard I mean even [23:21:34] (03CR) 10Dzahn: [C: 031] "yea, what he says, i also didn't expect labs instances to get default at all" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141598 (owner: 10Andrew Bogott) [23:24:05] (03PS2) 10Andrew Bogott: Apply site.pp defaults for production only. [operations/puppet] - 10https://gerrit.wikimedia.org/r/141598 [23:25:14] thanks andrewbogott [23:26:24] (03CR) 10Andrew Bogott: [C: 032] Apply site.pp defaults for production only. [operations/puppet] - 10https://gerrit.wikimedia.org/r/141598 (owner: 10Andrew Bogott) [23:28:05] (03CR) 10Tim Landscheidt: "This feels a bit excessive, as it kicks out ganglia, ntp::client & Co. as well." [operations/puppet] - 10https://gerrit.wikimedia.org/r/141598 (owner: 10Andrew Bogott) [23:28:46] (03PS1) 10Awight: remove unused configuration [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/141601 [23:29:36] ah, chasemp, perhaps I should only have excluded admins? It looks like 'standard' was applied on labs previously [23:29:40] (03CR) 10Dzahn: "he's right i guess, the "standard" part didn't cause a problem before, just the new addition of "admin"" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141598 (owner: 10Andrew Bogott) [23:30:01] andrewbogott: yeah seems so [23:30:11] andrewbogott: that, or add "standard" to the place in LDAP ? [23:30:20] ^that [23:30:21] the place where labs instances get their own default from [23:30:22] seems most sane? [23:30:32] scfc_de: ? [23:30:34] hm… [23:30:40] if you look at applied in ldap it seems weird it picks this one thing up outside of it [23:30:54] yeah, true. Ok. [23:31:47] (03PS1) 10Andrew Bogott: include standard in the labs role. [operations/puppet] - 10https://gerrit.wikimedia.org/r/141602 [23:32:04] mutante: I don't have a strong opinion where standard & Co. should come from. They just need to be applied on (most) instances in Labs :-) (I think standard => base => base::puppet even cares for Puppet to be installed). [23:32:35] scfc_de: you're totally right -- I misread the git history and thought that whole default block was new. Didn't realize it was just the one line that changed. [23:32:38] Anyway… ^^ [23:32:42] (03CR) 10Rush: [C: 031] "consistency for the win!" [operations/puppet] - 10https://gerrit.wikimedia.org/r/141602 (owner: 10Andrew Bogott) [23:32:55] (03CR) 10Dzahn: [C: 031] include standard in the labs role. [operations/puppet] - 10https://gerrit.wikimedia.org/r/141602 (owner: 10Andrew Bogott) [23:34:33] (03CR) 10Andrew Bogott: [C: 032] include standard in the labs role. [operations/puppet] - 10https://gerrit.wikimedia.org/r/141602 (owner: 10Andrew Bogott) [23:34:58] okay so default has a prod wrapper and all labs stuff comes from ldap [23:39:10] (03CR) 10Dzahn: [C: 032] rancid: also create system group for system user [operations/puppet] - 10https://gerrit.wikimedia.org/r/141078 (owner: 10Dzahn) [23:43:33] AaronSchulz: the job queue length looks saner since you added the automatic restart job [23:49:42] (03CR) 10Dzahn: [C: 032] pmacct - also create group for systemuser [operations/puppet] - 10https://gerrit.wikimedia.org/r/141074 (owner: 10Dzahn) [23:51:29] anyone in Ops have a moment to help to troubleshoot a DNS issue with corp? [23:51:52] dig @ns1.wikimedia.org ns1.corp.wikimedia.org [23:52:39] I'm expecting 198.73.209.11, but keep seeing 216.38.130.189 [23:52:56] is it possible that ns1.wikimedia.org has A records for these ? [23:53:06] I couldn't find anything in puppet [23:53:15] opening an RT [23:53:25] <^demon|away> dns isn't in puppet. [23:53:29] <^demon|away> it's in operations/dns repo [23:53:33] aha [23:53:40] hidden or public? [23:53:42] <^demon|away> public. [23:53:44] public [23:53:55] off to peek at that [23:54:02] cajoel: yes,.. and yes [23:54:05] i can confirm that [23:54:14] ok, RT ticvekt to change entries? [23:54:38] or checkout and submit gerrit, etc. [23:54:47] pick which you prefer [23:55:10] which do you think will go faster ?? :) [23:55:17] the latter if you know how to do it [23:55:26] since it means someone just reviews it rather than has to do it themselves [23:55:34] 703 ; Corp glue records [23:55:36] (but you can file rt instead of doing it yourself ;) [23:56:20] easy to do it in gerrit [23:56:31] found it, thx mutante.. [23:56:45] looks like {serial} is templated, so that's nice [23:56:54] any other gotchas? [23:57:07] <^demon|away> mutante, RobH: Either of you mind looking at https://gerrit.wikimedia.org/r/#/c/140665/? It's a pretty easy code style cleanup, no real changes. [23:57:21] <^demon|away> Only on my dashboard because lucene. [23:58:08] ^demon|away: yes