[00:00:04] RoanKattouw, ^d, marktraceur: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150108T0000). Please do the needful. [00:04:48] 3operations: Complete the use of HHVM over Zend PHP on the Wikimedia cluster - https://phabricator.wikimedia.org/T86081#961249 (10chasemp) p:5Triage>3High [00:05:21] 3operations: Complete the use of HHVM over Zend PHP on the Wikimedia cluster - https://phabricator.wikimedia.org/T86081#961194 (10chasemp) @joe or @ori can you weigh in here? [00:11:40] 3operations: Complete the use of HHVM over Zend PHP on the Wikimedia cluster - https://phabricator.wikimedia.org/T86081#961274 (10Reedy) >>! In T86081#961207, @Jdforrester-WMF wrote: >>>! In T75901#961097, @Legoktm wrote: >> We're not at 100% HHVM yet. imagescalers, job runners, and other servers like terbium an... [00:13:43] (03PS1) 10RobH: granting cluster deployment access to Andrew Green [puppet] - 10https://gerrit.wikimedia.org/r/183403 [00:14:43] ^demon|away, RoanKattouw_away or marktraceur, are you up to swat? I've added a couple changes [00:16:00] (03PS2) 10RobH: granting cluster deployment access to Andrew Green [puppet] - 10https://gerrit.wikimedia.org/r/183403 [00:22:20] meh, i'll just deploy it myself [00:24:32] Sorry :( [00:24:36] I'm cooking [00:24:38] Otherwise I would [00:24:42] :P [00:25:17] also, NERD COOKING ALERT! [00:29:21] 3operations: Switch HAT appservers to trusty's ICU - https://phabricator.wikimedia.org/T86096#961386 (10faidon) 3NEW [00:32:38] bd808, logstash is dead [00:32:52] blerg [00:33:20] whoa, really dead [00:34:51] !log restarted logstash on logstash1001 [00:34:58] Logged the message, Master [00:37:10] !log maxsem Synchronized php-1.25wmf14/extensions/WikiGrok/: https://gerrit.wikimedia.org/r/#/c/183186/ (duration: 00m 08s) [00:37:14] Logged the message, Master [00:37:57] !log maxsem Synchronized php-1.25wmf13/extensions/WikiGrok/: https://gerrit.wikimedia.org/r/#/c/183186/ (duration: 00m 07s) [00:38:01] Logged the message, Master [00:38:55] !log elasticsearch cluster for logstash is split brain. [00:39:01] Logged the message, Master [00:39:35] zomgfail [00:40:01] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 46 threshold =0.1% breach: {ustatus: uyellow, unumber_of_nodes: 3, uunassigned_shards: 44, utimed_out: False, uactive_primary_shards: 68, ucluster_name: uproduction-logstash-eqiad, urelocating_shards: 0, uactive_shards: 158, uinitializing_shards: 2, unumber_of_data_nodes: 3} [00:40:09] !log restarted elasticsearch on logstash1002 to heal split brain in cluster [00:40:13] Logged the message, Master [00:44:29] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 13, timed_out: False, active_primary_shards: 67, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 186, initializing_shards: 2, number_of_data_nodes: 3 [00:44:50] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 13, timed_out: False, active_primary_shards: 67, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 186, initializing_shards: 2, number_of_data_nodes: 3 [00:45:09] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 3, unassigned_shards: 13, timed_out: False, active_primary_shards: 67, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 186, initializing_shards: 2, number_of_data_nodes: 3 [00:45:59] PROBLEM - HHVM rendering on mw1226 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50722 bytes in 0.007 second response time [00:46:20] PROBLEM - Apache HTTP on mw1226 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 50722 bytes in 0.020 second response time [00:53:46] something went nuts in the logstash date parsing code [00:54:13] there were 8 indicies dated from Jan 2014 that I just deleted [00:54:47] and now I see a whole set dated December 2015 [00:56:33] All the logs with weird dates were sent in via syslog [00:57:25] *shrug* I'll clean them up [01:00:21] frack [01:01:36] !log accidentally deleted 2015-01-07 logstash index when cleaning up rogue indices for 2014-01-* [01:01:40] Logged the message, Master [01:02:00] yesterday didn't happen. nothing to see here folks [01:04:24] !log cleaned up logstash indices dated 2014-01-* and 2015-12-* that look to have been created by some sort of syslog input parsing bug [01:04:29] Logged the message, Master [01:17:23] 3Services, MediaWiki-General-or-Unknown, operations, Wikidata, wikidata-query-service: Reliable publish / subscribe event bus - https://phabricator.wikimedia.org/T84923#961622 (10bd808) > can support large delays (order of days) for individual consumers Do you have a strong use case to support this need? Kafka... [01:27:35] 3Services, MediaWiki-General-or-Unknown, Analytics, operations, Wikidata, wikidata-query-service: Reliable publish / subscribe event bus - https://phabricator.wikimedia.org/T84923#961647 (10GWicke) >>! In T84923#961622, @bd808 wrote: >> can support large delays (order of days) for individual consumers > > Do yo... [01:36:44] 3Scrum-of-Scrums, RESTBase, Services, operations: Restbase deployment - https://phabricator.wikimedia.org/T1228#961655 (10GWicke) [01:52:57] (03CR) 10BryanDavis: "https://gerrit.wikimedia.org/r/#/c/181346/ is on the group0 wikis now so this can go out." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181350 (https://phabricator.wikimedia.org/T85067) (owner: 10BryanDavis) [02:10:40] (03PS2) 10Springle: assign a round of codfw slaves to shards [puppet] - 10https://gerrit.wikimedia.org/r/183258 [02:12:29] (03CR) 10Springle: [C: 032] assign a round of codfw slaves to shards [puppet] - 10https://gerrit.wikimedia.org/r/183258 (owner: 10Springle) [02:46:51] PROBLEM - puppet last run on mw1117 is CRITICAL: CRITICAL: Puppet has 1 failures [02:50:09] (03PS1) 10Faidon Liambotis: Remove zinc from dsh [puppet] - 10https://gerrit.wikimedia.org/r/183414 [02:50:11] (03PS1) 10Faidon Liambotis: Remove all references to solr [puppet] - 10https://gerrit.wikimedia.org/r/183415 [02:50:13] (03PS1) 10Faidon Liambotis: icinga: disable embedded perl [puppet] - 10https://gerrit.wikimedia.org/r/183416 [02:50:15] (03PS1) 10Faidon Liambotis: icinga: remove -epn from check_ssl & check_sslxNN [puppet] - 10https://gerrit.wikimedia.org/r/183417 [02:50:17] (03PS1) 10Faidon Liambotis: base: update base::sysctl for trusty's new keys [puppet] - 10https://gerrit.wikimedia.org/r/183418 [02:50:19] (03PS1) 10Faidon Liambotis: install-server::ubuntu-mirror -> mirrors::ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/183419 [02:50:21] (03PS1) 10Faidon Liambotis: mirrors: introduce mirrors::debian and use it [puppet] - 10https://gerrit.wikimedia.org/r/183420 [02:50:23] (03PS1) 10Faidon Liambotis: autoinstall: switch to using our own Debian mirror [puppet] - 10https://gerrit.wikimedia.org/r/183421 [02:50:25] (03PS1) 10Faidon Liambotis: apt: use our own Debian mirror instead of proxying [puppet] - 10https://gerrit.wikimedia.org/r/183422 [02:50:27] (03PS1) 10Faidon Liambotis: labs_bootstrapvz: use our own Debian mirror [puppet] - 10https://gerrit.wikimedia.org/r/183423 [02:50:34] (03PS1) 10Faidon Liambotis: Switch ubuntu.wikimedia.org to a CNAME [dns] - 10https://gerrit.wikimedia.org/r/183424 [02:50:36] (03PS1) 10Faidon Liambotis: Add mirrors.wikimedia.org service alias [dns] - 10https://gerrit.wikimedia.org/r/183425 [02:51:20] ...and now I'm gone :) [02:51:42] (03CR) 10jenkins-bot: [V: 04-1] mirrors: introduce mirrors::debian and use it [puppet] - 10https://gerrit.wikimedia.org/r/183420 (owner: 10Faidon Liambotis) [02:52:38] (03CR) 10jenkins-bot: [V: 04-1] install-server::ubuntu-mirror -> mirrors::ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/183419 (owner: 10Faidon Liambotis) [02:53:58] (03PS2) 10Faidon Liambotis: autoinstall: switch to using our own Debian mirror [puppet] - 10https://gerrit.wikimedia.org/r/183421 [02:54:00] (03PS2) 10Faidon Liambotis: mirrors: introduce mirrors::debian and use it [puppet] - 10https://gerrit.wikimedia.org/r/183420 [02:54:02] (03PS2) 10Faidon Liambotis: labs_bootstrapvz: use our own Debian mirror [puppet] - 10https://gerrit.wikimedia.org/r/183423 [02:54:04] (03PS2) 10Faidon Liambotis: apt: use our own Debian mirror instead of proxying [puppet] - 10https://gerrit.wikimedia.org/r/183422 [03:04:41] RECOVERY - puppet last run on mw1117 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [03:06:05] !log xtrabackup clone db1037 to db1050 [03:06:14] Logged the message, Master [03:16:36] (03PS1) 10Springle: deploy db1050 in s6 [puppet] - 10https://gerrit.wikimedia.org/r/183427 [03:16:38] (03PS1) 10Springle: sideline db1002 db1003 db1005 db1006 db1009 [puppet] - 10https://gerrit.wikimedia.org/r/183428 [03:17:41] (03CR) 10Springle: [C: 032] deploy db1050 in s6 [puppet] - 10https://gerrit.wikimedia.org/r/183427 (owner: 10Springle) [03:18:18] (03CR) 10Springle: [C: 032] sideline db1002 db1003 db1005 db1006 db1009 [puppet] - 10https://gerrit.wikimedia.org/r/183428 (owner: 10Springle) [04:21:43] (03PS1) 10KartikMistry: WIP: Accept requests from the given domains [puppet] - 10https://gerrit.wikimedia.org/r/183429 [05:38:35] PROBLEM - nutcracker port on mw1225 is CRITICAL: Connection refused [05:44:06] (03PS1) 10Ori.livneh: Have nutcracker listen on a UNIX domain socket on mw1230 and mw1231 [puppet] - 10https://gerrit.wikimedia.org/r/183436 [05:44:28] ^ TimStarling, _joe_ [05:45:45] RECOVERY - nutcracker port on mw1225 is OK: TCP OK - 0.000 second response time on port 11212 [05:46:17] that's a good reminder, need to disable the alert for those hosts too [05:47:24] right... [05:48:42] you could make it listen on both TCP and UNIX if you added another pool [05:49:02] then you wouldn't need simultaneous deployment of the client change and the server change [05:49:15] PROBLEM - puppet last run on db2029 is CRITICAL: CRITICAL: puppet fail [05:49:59] (03PS2) 10Ori.livneh: Have nutcracker listen on a UNIX domain socket on mw1230 and mw1231 [puppet] - 10https://gerrit.wikimedia.org/r/183436 [05:50:42] I can show you what I mean with a patchset if you like [05:50:59] doing the puppet change, then depooling, letting connections drain, and then doing the wmf-config change seems just as easy [05:51:06] but i don't mind either way [05:51:46] well, depool first, then puppet + wmf-config change [05:52:14] actually, your way is better, because then we don't need to touch the monitoring [05:53:03] but i wouldn't want to add the second pool to all hosts, because that would still be a cluster-wide nutcracker restart [05:54:19] yeah, i think i'm back to preferring the way PS2 does it [05:54:21] but your call [05:55:14] I don't understand how you would deploy that to all servers [05:55:42] oh, i see what you're saying. you're looking ahead to actually rolling this out to the rest of the servers. [05:55:55] this is a test, right? [05:56:33] even with two servers, after you commit the puppet change, you would have to shut down apache on those two servers until the puppet run finishes and the client change is deployed [05:57:23] or remove the pid file and allow and the service to spawn a second instance [05:57:32] but that's not very nice [05:57:49] you would also have to change the admin port [05:57:59] fine, fine. ok, if you're still up for updating the patch, go for it. i need to get some water anwyay. [06:06:03] (03PS3) 10Tim Starling: Have nutcracker listen on a UNIX domain socket on mw1230 and mw1231 [puppet] - 10https://gerrit.wikimedia.org/r/183436 (owner: 10Ori.livneh) [06:08:24] RECOVERY - puppet last run on db2029 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:12:12] (03CR) 10Ori.livneh: [C: 031] Have nutcracker listen on a UNIX domain socket on mw1230 and mw1231 [puppet] - 10https://gerrit.wikimedia.org/r/183436 (owner: 10Ori.livneh) [06:14:24] i wasn't sure if puppet supports hash item assignment but i tested it and it works [06:14:43] the docs can't decide if they're immutable or not [06:15:39] i'd still depool them, to be honest, because it's easy to imagine that we overlooked something, and having alerts go off would suck [06:28:15] PROBLEM - puppet last run on amssq35 is CRITICAL: CRITICAL: Puppet has 2 failures [06:28:35] PROBLEM - puppet last run on cp3016 is CRITICAL: CRITICAL: Puppet has 1 failures [06:28:44] PROBLEM - puppet last run on cp1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:25] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:29:25] PROBLEM - puppet last run on db2018 is CRITICAL: CRITICAL: Puppet has 1 failures [06:36:27] <_joe_> ori: hash assignment is deprecated [06:36:51] <_joe_> and I can take over from where you got anyway [06:36:57] <_joe_> (with that change) [06:37:34] works for me [06:37:34] <_joe_> once I'm properly awake anyways [06:37:40] <_joe_> :) [06:45:24] RECOVERY - puppet last run on cp3016 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [06:46:05] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:46:05] RECOVERY - puppet last run on db2018 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [06:46:15] RECOVERY - puppet last run on amssq35 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [06:46:35] RECOVERY - puppet last run on cp1061 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [07:40:04] (03PS2) 10Giuseppe Lavagetto: mediawiki: create "canary" pools to allow testing on subclusters [puppet] - 10https://gerrit.wikimedia.org/r/183226 [07:40:40] <_joe_> ok, to work [07:44:17] (03PS3) 10Giuseppe Lavagetto: mediawiki: create "canary" pools to allow testing on subclusters [puppet] - 10https://gerrit.wikimedia.org/r/183226 [07:52:05] (03PS4) 10Giuseppe Lavagetto: mediawiki: create "canary" pools to allow testing on subclusters [puppet] - 10https://gerrit.wikimedia.org/r/183226 [07:52:21] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: create "canary" pools to allow testing on subclusters [puppet] - 10https://gerrit.wikimedia.org/r/183226 (owner: 10Giuseppe Lavagetto) [08:11:07] <_joe_> !log installing a new hhvm package version on the canary pools [08:11:14] Logged the message, Master [08:11:57] <_joe_> oh, dear. looks like this package is linked to the newer libicu for some reason. Depooling the servers, rebuilding the package [08:15:20] <_joe_> !log depooled the canary appservers while a new package version is rebuilt [08:15:25] Logged the message, Master [08:46:31] <_joe_> building packages on labs is as funny as being subjected to a dental extraction [08:48:34] PROBLEM - Varnishkafka Delivery Errors per minute on cp3019 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [30.0] [08:49:35] PROBLEM - Varnishkafka Delivery Errors per minute on cp3021 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [30.0] [08:51:03] (03PS1) 10Giuseppe Lavagetto: api: restore ganglia aggregators [puppet] - 10https://gerrit.wikimedia.org/r/183448 [08:51:25] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] api: restore ganglia aggregators [puppet] - 10https://gerrit.wikimedia.org/r/183448 (owner: 10Giuseppe Lavagetto) [08:53:43] (03PS2) 10Giuseppe Lavagetto: mediawiki: use the worker mpm on the canary clusters [puppet] - 10https://gerrit.wikimedia.org/r/183227 [08:53:52] Can someone in ops list check whether Lego's question got an answer? Cf. https://phabricator.wikimedia.org/T73241#948613 [08:53:53] (03PS2) 10Yuvipanda: shinken: Add ssh checks for all monitored hosts [puppet] - 10https://gerrit.wikimedia.org/r/181807 (https://phabricator.wikimedia.org/T86027) [08:54:09] (03PS3) 10Yuvipanda: shinken: Add ssh checks for all monitored hosts [puppet] - 10https://gerrit.wikimedia.org/r/181807 (https://phabricator.wikimedia.org/T86027) [08:54:39] <_joe_> Nemo_bis: we did for sure [08:54:55] <_joe_> and the number of emails was greatly reduced I guess [08:55:23] <_joe_> (buongiorno) [08:55:25] RECOVERY - Varnishkafka Delivery Errors per minute on cp3021 is OK: OK: Less than 1.00% above the threshold [0.0] [08:55:35] RECOVERY - Varnishkafka Delivery Errors per minute on cp3019 is OK: OK: Less than 1.00% above the threshold [0.0] [08:55:49] giorno :) [08:56:20] (03CR) 10Giuseppe Lavagetto: [C: 032] mediawiki: use the worker mpm on the canary clusters [puppet] - 10https://gerrit.wikimedia.org/r/183227 (owner: 10Giuseppe Lavagetto) [08:56:49] Reduced? I only know about 200k invalid addresses removed, out of some millions to be emailed. If something else was decided, it should be documented in the task description [08:57:10] <_joe_> Nemo_bis: well it was lego's work [08:57:16] <_joe_> I just read the thread [08:57:28] <_joe_> (aren't ops archives public?) [08:58:06] (no) [09:00:10] (there about half a dozen ops-y "security" lists plus an undefined number of security channels) [09:00:38] <_joe_> godog: ping [09:00:38] _joe_: ping detected, please leave a message! [09:00:48] <_joe_> oh, tor is back! [09:01:36] gretings [09:01:38] hey _joe_ [09:02:16] <_joe_> godog: re https://gerrit.wikimedia.org/r/#/c/183256, is the version of uwsgi on tungsten able to support cgroups? [09:02:34] it is [09:02:46] (03CR) 10Giuseppe Lavagetto: [C: 031] graphite: limit uwsgi workers memory [puppet] - 10https://gerrit.wikimedia.org/r/183256 (owner: 10Filippo Giunchedi) [09:05:16] (03PS4) 10Giuseppe Lavagetto: Have nutcracker listen on a UNIX domain socket on mw1230 and mw1231 [puppet] - 10https://gerrit.wikimedia.org/r/183436 (owner: 10Ori.livneh) [09:05:23] <_joe_> because that's a no-brainer apart from that [09:05:26] <_joe_> ok [09:06:12] <_joe_> mmh [09:08:18] (03CR) 10Yuvipanda: "https://phabricator.wikimedia.org/T86143 for the blocking issue" [puppet] - 10https://gerrit.wikimedia.org/r/181807 (https://phabricator.wikimedia.org/T86027) (owner: 10Yuvipanda) [09:15:10] 3ops-eqiad, operations: virt1004 - https://phabricator.wikimedia.org/T85798#962179 (10fgiunchedi) 5Open>3Resolved a:3fgiunchedi completed ``` virt1004:~$ cat /proc/mdstat Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] md1 : active raid10 sde2[8] sda2[0] sdh2[7] sdc... [09:33:43] (03PS1) 10Hashar: beta: monitor mobile main page [puppet] - 10https://gerrit.wikimedia.org/r/183454 (https://phabricator.wikimedia.org/T54867) [09:36:41] <_joe_> I'm curious about what kind of tests do we do in beta. We had an hhvm package using and incompatible libicu for ~ 1 month and no test noticed? [09:37:06] (03CR) 10Giuseppe Lavagetto: [C: 032] Have nutcracker listen on a UNIX domain socket on mw1230 and mw1231 [puppet] - 10https://gerrit.wikimedia.org/r/183436 (owner: 10Ori.livneh) [09:40:19] (03PS2) 10Filippo Giunchedi: graphite: limit uwsgi workers memory [puppet] - 10https://gerrit.wikimedia.org/r/183256 [09:40:25] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] graphite: limit uwsgi workers memory [puppet] - 10https://gerrit.wikimedia.org/r/183256 (owner: 10Filippo Giunchedi) [09:42:13] !log restart uwsgi on tungsten [09:42:16] Logged the message, Master [09:50:32] (03PS3) 10Hashar: Graph User::pingLimiter() actions in gdash [puppet] - 10https://gerrit.wikimedia.org/r/166511 (https://bugzilla.wikimedia.org/65478) (owner: 10Nemo bis) [09:51:09] godog: buongiorno! Do you have some spare cycles to apply a gdash change https://gerrit.wikimedia.org/r/#/c/166511/ ? :D [09:51:14] PROBLEM - DPKG on labmon1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [09:51:41] seems to be on tungsten.eqiad.wmnet [09:51:50] hashar: how easy is it to add a post-merge hook to a repo? It'd be awesome if +2'ing on wikibugs2 auto-pulls the new changes [09:52:05] valhallasw`cloud: auto pull? What do you mean ? [09:52:23] valhallasw`cloud: refreshing the config on the tools labs + reloading the service maybe? [09:52:25] RECOVERY - DPKG on labmon1001 is OK: All packages OK [09:52:36] hashar: basically. service reload is not necessary as config is loaded dynamically [09:53:14] hashar: I can expose 'git pull' over http, so a simple curl should work for it, I think [09:53:18] valhallasw`cloud: we would need some machine on tool labs to be made a Jenkins slave, then craft a job that runs some shell command on that host. [09:53:25] PROBLEM - puppet last run on helium is CRITICAL: CRITICAL: Puppet last ran 14 days ago [09:53:58] valhallasw`cloud: or an API entry point yeah. Might work as well :] [09:54:44] RECOVERY - puppet last run on helium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:56:41] 3operations: set git author and committer name when running as root - https://phabricator.wikimedia.org/T86146#962236 (10fgiunchedi) 3NEW [09:59:18] hashar: buongiorno :) did you see my comment on the related ticket? [10:00:04] godog: ouch. There is a problem houston :] [10:00:14] 3operations: set git author and committer name when running as root - https://phabricator.wikimedia.org/T86146#962242 (10valhallasw) For my tool labs projects, I have ``` alias git="HOME=/home/$SUDO_USER git" ``` in .profile, which not only sets the author and email, but also uses the rest of the users' git c... [10:00:20] hashar: https://tools.wmflabs.org/wikibugs/pull.php ;-) [10:00:24] PROBLEM - Varnishkafka Delivery Errors per minute on cp3022 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [30.0] [10:00:51] (03CR) 10Hashar: [C: 04-1] "Per Filippo comment on the task, some link does not seem to work anymore https://phabricator.wikimedia.org/T67478#957152 :-(" [puppet] - 10https://gerrit.wikimedia.org/r/166511 (https://bugzilla.wikimedia.org/65478) (owner: 10Nemo bis) [10:01:07] godog: -1ed the patch, will let Nemo_bis tweak the gdash links. Thank you! [10:01:48] hashar: np :) [10:02:54] !log removing backend hosts from LVS for search pools [10:02:58] Logged the message, Master [10:03:42] I've disabled notifications in icinga for those it should be quiet [10:06:14] PROBLEM - LVS Lucene on search-pool1.svc.eqiad.wmnet is CRITICAL: Connection refused [10:06:24] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection refused [10:06:26] thanks icinga [10:06:34] RECOVERY - Varnishkafka Delivery Errors per minute on cp3022 is OK: OK: Less than 1.00% above the threshold [0.0] [10:06:44] PROBLEM - LVS Lucene on search-pool2.svc.eqiad.wmnet is CRITICAL: Connection refused [10:06:53] sigh sorry about the page [10:07:15] PROBLEM - LVS Lucene on search-pool3.svc.eqiad.wmnet is CRITICAL: Connection refused [10:08:11] ok, false alarm :-) [10:08:18] 3operations, Continuous-Integration: [OPS] Jenkins: Slaves running Ubuntu Trusty should have hhvm installed - https://phabricator.wikimedia.org/T75356#962252 (10hashar) 5Open>3Resolved Patch https://gerrit.wikimedia.org/r/#/c/178806/ is still pending review but otherwise has been already deployed. We have J... [10:08:34] PROBLEM - LVS Lucene on search-pool5.svc.eqiad.wmnet is CRITICAL: Connection refused [10:08:47] akosiaris: yeah sorry about that, silencing properly now [10:10:27] TIL: double check if silencing just the host or the host and all the services [10:10:34] godog: I think you disable notifications on the host not the service [10:10:41] disabled* [10:10:48] but!!! there is also fixed downtime ... [10:10:55] yeah I added that now [10:10:56] it should NOT have paged us [10:11:10] I've closed the door after the cattle ran out :) [10:11:37] ah, ok then [10:12:07] keep in mind btw that with the load that neon has right now, it might take a while before it processes a command [10:12:26] altough this is a lot better than before thanks to paravoid [10:12:47] indeed neon is in a much better shape [10:13:37] 3operations: Decomission lsearchd - https://phabricator.wikimedia.org/T85009#962270 (10fgiunchedi) all backend hosts commented in pybal config for search_pool1-5 and search_prefix and disabled notification in icinga [10:16:31] ok so if disabling notification for all services on a host in icinga you get the "disable notifications for the host too" but the converse isn't true of course [10:17:15] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 6.67% of data above the critical threshold [500.0] [10:17:31] <_joe_> mh [10:18:02] (03PS6) 10QChris: Add jobs for aggregating hourly projectcount files to daily per wiki csvs [puppet] - 10https://gerrit.wikimedia.org/r/172201 (https://bugzilla.wikimedia.org/72740) [10:19:15] PROBLEM - Varnishkafka Delivery Errors per minute on cp3019 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [30.0] [10:19:43] (03PS7) 10QChris: Add jobs for aggregating hourly projectcount files to daily per wiki csvs [puppet] - 10https://gerrit.wikimedia.org/r/172201 (https://bugzilla.wikimedia.org/72740) [10:22:55] PROBLEM - git.wikimedia.org on antimony is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:24:05] RECOVERY - Varnishkafka Delivery Errors per minute on cp3019 is OK: OK: Less than 1.00% above the threshold [0.0] [10:25:35] PROBLEM - Varnishkafka Delivery Errors per minute on cp3005 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [30.0] [10:27:34] PROBLEM - Varnishkafka Delivery Errors per minute on cp3022 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [30.0] [10:28:44] PROBLEM - Varnishkafka Delivery Errors per minute on cp3021 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [30.0] [10:29:25] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [10:29:55] RECOVERY - git.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 57873 bytes in 0.325 second response time [10:32:45] RECOVERY - Varnishkafka Delivery Errors per minute on cp3005 is OK: OK: Less than 1.00% above the threshold [0.0] [10:33:25] RECOVERY - Varnishkafka Delivery Errors per minute on cp3022 is OK: OK: Less than 1.00% above the threshold [0.0] [10:34:34] RECOVERY - Varnishkafka Delivery Errors per minute on cp3021 is OK: OK: Less than 1.00% above the threshold [0.0] [10:40:14] 3operations: reclaim lsearchd hosts - https://phabricator.wikimedia.org/T86149#962283 (10fgiunchedi) 3NEW a:3fgiunchedi [10:40:50] 3operations: remove lsearchd support from puppet - https://phabricator.wikimedia.org/T86150#962291 (10fgiunchedi) 3NEW a:3fgiunchedi [10:41:20] <_joe_> godog: so we have quite some hosts around, right? [10:41:46] <_joe_> godog: using them for graphite? [10:42:45] PROBLEM - DPKG on helium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:45:14] RECOVERY - DPKG on helium is OK: All packages OK [10:46:29] (03PS2) 10Filippo Giunchedi: lsearchd: remove lvs configuration [puppet] - 10https://gerrit.wikimedia.org/r/183462 (https://phabricator.wikimedia.org/T85009) [10:46:58] _joe_: no we have graphite hosts on the way, not sure we have a designated use for those yet [10:55:31] <_joe_> !log installing a new hhvm package (with the correct libicu dependence) on canary hosts [10:55:35] Logged the message, Master [11:07:33] <_joe_> !log repooled the canary servers [11:07:39] Logged the message, Master [11:09:56] (03PS1) 10Filippo Giunchedi: lsearchd: remove udp2log configuration [puppet] - 10https://gerrit.wikimedia.org/r/183469 (https://phabricator.wikimedia.org/T85009) [11:23:45] PROBLEM - Varnishkafka Delivery Errors per minute on cp3021 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [30.0] [11:30:22] (03PS1) 10Filippo Giunchedi: gdash: add carbon-cache utilization dashboard [puppet] - 10https://gerrit.wikimedia.org/r/183476 [11:30:45] RECOVERY - Varnishkafka Delivery Errors per minute on cp3021 is OK: OK: Less than 1.00% above the threshold [0.0] [11:32:24] fairly easy ^ if anyone is interested [11:34:36] 3operations: set git author and committer name when running as root - https://phabricator.wikimedia.org/T86146#962374 (10fgiunchedi) that would override other things like ssh keys but indeed we could export GIT_CONFIG instead if SUDO_USER is set and the user has ~/.gitconfig. doing that however requires each use... [11:37:29] !log reboot ms-be1011, xfs hosed :( [11:37:37] Logged the message, Master [11:45:14] PROBLEM - Varnishkafka Delivery Errors per minute on cp3022 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [30.0] [11:48:53] (03PS2) 10Faidon Liambotis: Remove all references to solr [puppet] - 10https://gerrit.wikimedia.org/r/183415 [11:48:55] (03PS2) 10Faidon Liambotis: Remove zinc from dsh [puppet] - 10https://gerrit.wikimedia.org/r/183414 [11:48:57] (03PS2) 10Faidon Liambotis: icinga: remove -epn from check_ssl & check_sslxNN [puppet] - 10https://gerrit.wikimedia.org/r/183417 [11:48:59] (03PS2) 10Faidon Liambotis: icinga: disable embedded perl [puppet] - 10https://gerrit.wikimedia.org/r/183416 [11:49:01] (03PS2) 10Faidon Liambotis: install-server::ubuntu-mirror -> mirrors::ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/183419 [11:49:03] (03PS2) 10Faidon Liambotis: base: update base::sysctl for trusty's new keys [puppet] - 10https://gerrit.wikimedia.org/r/183418 [11:49:05] (03PS3) 10Faidon Liambotis: autoinstall: switch to using our own Debian mirror [puppet] - 10https://gerrit.wikimedia.org/r/183421 [11:49:07] (03PS3) 10Faidon Liambotis: mirrors: introduce mirrors::debian and use it [puppet] - 10https://gerrit.wikimedia.org/r/183420 [11:49:09] (03PS3) 10Faidon Liambotis: labs_bootstrapvz: use our own Debian mirror [puppet] - 10https://gerrit.wikimedia.org/r/183423 [11:49:11] (03PS3) 10Faidon Liambotis: apt: use our own Debian mirror instead of proxying [puppet] - 10https://gerrit.wikimedia.org/r/183422 [11:49:13] (03PS1) 10Faidon Liambotis: nagios_common: add check_ssl's Package dependencies [puppet] - 10https://gerrit.wikimedia.org/r/183478 [11:49:35] anyone feel like reviewing any of these? :) [11:50:06] (03CR) 10Faidon Liambotis: [C: 032] Remove zinc from dsh [puppet] - 10https://gerrit.wikimedia.org/r/183414 (owner: 10Faidon Liambotis) [11:50:18] (03CR) 10Faidon Liambotis: [C: 032] Remove all references to solr [puppet] - 10https://gerrit.wikimedia.org/r/183415 (owner: 10Faidon Liambotis) [11:50:52] (03CR) 10jenkins-bot: [V: 04-1] install-server::ubuntu-mirror -> mirrors::ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/183419 (owner: 10Faidon Liambotis) [11:51:59] haha we should introduce carrots and sticks for code reviews [11:54:44] <_joe_> paravoid: in a few, it's the perfect pre-lunch activity [11:55:45] RECOVERY - Varnishkafka Delivery Errors per minute on cp3022 is OK: OK: Less than 1.00% above the threshold [0.0] [11:55:46] does mirrors.wikimedia.org sound ok or should I make it mirror.wikimedia.org? thoughts? [11:55:56] (03CR) 10Alexandros Kosiaris: [C: 032] icinga: remove -epn from check_ssl & check_sslxNN [puppet] - 10https://gerrit.wikimedia.org/r/183417 (owner: 10Faidon Liambotis) [11:55:59] <_joe_> mirror.w.org [11:56:06] <_joe_> imho [11:56:09] (03CR) 10Alexandros Kosiaris: [C: 032] icinga: disable embedded perl [puppet] - 10https://gerrit.wikimedia.org/r/183416 (owner: 10Faidon Liambotis) [11:56:21] yeah singular [11:56:33] I picked mirrors for two reasons [11:56:36] won't it have many mirrors ? ubuntu, debian etc ? [11:56:49] first, I had to name the module "mirrors" to avoid confusion with role::mirror which is another (badly named) thing :/ [11:56:56] that I didn't feel like touching [11:56:58] (grr) [11:57:34] and then to avoid confusion with this being a "wikimedia mirror" rather than "mirrors hosted by wikimedia" [11:57:40] and the second was mirrors.kernel.org [11:58:21] <_joe_> oh ok [11:58:28] but I'm still not sure [11:58:33] of my choice :) [11:58:42] <_joe_> I am used to having the old-fashioned sunsite mirrors [11:58:52] <_joe_> paravoid: it's not /that/ important you know [12:00:01] (03PS3) 10Faidon Liambotis: install-server::ubuntu-mirror -> mirrors::ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/183419 [12:00:03] (03PS2) 10Faidon Liambotis: nagios_common: add check_ssl's Package dependencies [puppet] - 10https://gerrit.wikimedia.org/r/183478 [12:00:05] (03PS3) 10Faidon Liambotis: base: update base::sysctl for trusty's new keys [puppet] - 10https://gerrit.wikimedia.org/r/183418 [12:00:07] (03PS4) 10Faidon Liambotis: autoinstall: switch to using our own Debian mirror [puppet] - 10https://gerrit.wikimedia.org/r/183421 [12:00:08] indeed, both rationale make sense, I think mirror is more common [12:00:09] (03PS4) 10Faidon Liambotis: mirrors: introduce mirrors::debian and use it [puppet] - 10https://gerrit.wikimedia.org/r/183420 [12:00:11] (03PS4) 10Faidon Liambotis: labs_bootstrapvz: use our own Debian mirror [puppet] - 10https://gerrit.wikimedia.org/r/183423 [12:00:13] (03PS4) 10Faidon Liambotis: apt: use our own Debian mirror instead of proxying [puppet] - 10https://gerrit.wikimedia.org/r/183422 [12:01:08] (03CR) 10Faidon Liambotis: [C: 032] nagios_common: add check_ssl's Package dependencies [puppet] - 10https://gerrit.wikimedia.org/r/183478 (owner: 10Faidon Liambotis) [12:01:15] Krinkle: Addressed your points about the composer invocation [12:01:45] (03CR) 10Faidon Liambotis: [C: 032] base: update base::sysctl for trusty's new keys [puppet] - 10https://gerrit.wikimedia.org/r/183418 (owner: 10Faidon Liambotis) [12:05:06] (03CR) 10Faidon Liambotis: [C: 04-1] "Looking at the contents of misc::udp2log::instance, this will leave unmanaged files behind :/ If you are aware of it and plan on cleaning " [puppet] - 10https://gerrit.wikimedia.org/r/183469 (https://phabricator.wikimedia.org/T85009) (owner: 10Filippo Giunchedi) [12:06:53] (03CR) 10Faidon Liambotis: [C: 04-1] "What about manifests/role/lvs.pp search_pool references?" [puppet] - 10https://gerrit.wikimedia.org/r/183462 (https://phabricator.wikimedia.org/T85009) (owner: 10Filippo Giunchedi) [12:08:34] http://packages.ubuntu.com/vivid/amd64/libicu-dev/download [12:09:06] 5 "mirrors", 6 "mirror" [12:10:40] (03CR) 10Faidon Liambotis: [C: 032] Switch ubuntu.wikimedia.org to a CNAME [dns] - 10https://gerrit.wikimedia.org/r/183424 (owner: 10Faidon Liambotis) [12:10:49] (03CR) 10Faidon Liambotis: [C: 032] Add mirrors.wikimedia.org service alias [dns] - 10https://gerrit.wikimedia.org/r/183425 (owner: 10Faidon Liambotis) [12:11:25] _joe_: I'll wait for you for the rest [12:12:15] <_joe_> paravoid: I'm starting now :) [12:12:19] (03PS2) 10Giuseppe Lavagetto: Temporarily add Elasticsearch to einsteinium [puppet] - 10https://gerrit.wikimedia.org/r/181612 (owner: 10Manybubbles) [12:14:05] PROBLEM - puppet last run on rdb1002 is CRITICAL: CRITICAL: Puppet has 1 failures [12:15:38] (03CR) 10Giuseppe Lavagetto: [C: 031] install-server::ubuntu-mirror -> mirrors::ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/183419 (owner: 10Faidon Liambotis) [12:20:28] <_joe_> paravoid: archsync can only do blacklists and not whitelists? ewww [12:22:25] RECOVERY - puppet last run on rdb1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:24:08] _joe_: yeah... [12:24:14] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM, but I have one question" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/183420 (owner: 10Faidon Liambotis) [12:24:24] <_joe_> sigh [12:24:46] debmirror has the opposite for arches, (only white, not black) but cannot do blacklist for sections [12:25:04] and archvsync is in a horrible git repo, not packaged etc. [12:25:07] <_joe_> a plethora of incomplete tools. I love FLOSS [12:25:07] it's crazy [12:25:22] the irony of archvsync not being packaged is just... [12:25:35] (03CR) 10Giuseppe Lavagetto: [C: 031] autoinstall: switch to using our own Debian mirror [puppet] - 10https://gerrit.wikimedia.org/r/183421 (owner: 10Faidon Liambotis) [12:26:33] (03CR) 10Faidon Liambotis: mirrors: introduce mirrors::debian and use it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/183420 (owner: 10Faidon Liambotis) [12:28:18] (03CR) 10Giuseppe Lavagetto: mirrors: introduce mirrors::debian and use it (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/183420 (owner: 10Faidon Liambotis) [12:28:36] (03CR) 10Giuseppe Lavagetto: [C: 031] apt: use our own Debian mirror instead of proxying (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/183422 (owner: 10Faidon Liambotis) [12:29:24] <_joe_> I can't really review the bootstrapvz patch as I don't know much about it [12:30:17] yeah that's fine, me neither, that's for andrewb [12:31:08] re: apt, I didn't understand your question [12:31:35] <_joe_> you remove an ensure => absent resource from the ubuntu stanza [12:31:41] not really :) [12:31:51] <_joe_> ah fuck gerrit [12:31:52] I'm moving it out of the if [12:31:56] <_joe_> I hate looking at code there [12:31:57] <_joe_> yes [12:32:05] but we should remove it at some point, sure [12:32:13] <_joe_> ok [12:32:18] <_joe_> go on sorry [12:32:20] <_joe_> :) [12:34:25] <_joe_> does someone know why we still have the ganglia/ganglia_new dichotomy? [12:34:31] <_joe_> I'd like to unify those [12:34:40] because noone has worked on completing the transition [12:34:49] <_joe_> but before I embark in fixing that, I'd like to understand what stands in the way [12:38:26] <_joe_> paravoid: so there is no real technical difference between the two? [12:48:17] <_joe_> well, there are some substantial differences. the "old" ganglia uses multicast addresses, ganglia_new doesn't [12:54:10] yes [12:54:27] that was mark's work [12:54:37] moving us off multicast and into unicast aggregators, iirc [12:59:44] RECOVERY - very high load average likely xfs on ms-be1011 is OK: OK - load average: 50.35, 14.92, 5.16 [13:06:34] PROBLEM - HTTP 5xx req/min on tungsten is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [13:10:05] (03CR) 10Alexandros Kosiaris: [C: 032] install-server::ubuntu-mirror -> mirrors::ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/183419 (owner: 10Faidon Liambotis) [13:15:39] (03CR) 10Alexandros Kosiaris: [C: 031] mirrors: introduce mirrors::debian and use it [puppet] - 10https://gerrit.wikimedia.org/r/183420 (owner: 10Faidon Liambotis) [13:19:08] (03CR) 10Alexandros Kosiaris: [C: 032] autoinstall: switch to using our own Debian mirror [puppet] - 10https://gerrit.wikimedia.org/r/183421 (owner: 10Faidon Liambotis) [13:20:54] RECOVERY - HTTP 5xx req/min on tungsten is OK: OK: Less than 1.00% above the threshold [250.0] [13:22:25] (03CR) 10Faidon Liambotis: [C: 032] mirrors: introduce mirrors::debian and use it [puppet] - 10https://gerrit.wikimedia.org/r/183420 (owner: 10Faidon Liambotis) [13:27:21] (03CR) 10Faidon Liambotis: [C: 032] autoinstall: switch to using our own Debian mirror [puppet] - 10https://gerrit.wikimedia.org/r/183421 (owner: 10Faidon Liambotis) [13:27:34] (03CR) 10Faidon Liambotis: [C: 032] apt: use our own Debian mirror instead of proxying [puppet] - 10https://gerrit.wikimedia.org/r/183422 (owner: 10Faidon Liambotis) [13:30:14] PROBLEM - Slow CirrusSearch query rate on fluorine is CRITICAL: CirrusSearch-slow.log_line_rate CRITICAL: 0.00332225913621 [13:40:20] RECOVERY - Slow CirrusSearch query rate on fluorine is OK: CirrusSearch-slow.log_line_rate OKAY: 0.0 [13:40:30] PROBLEM - Varnishkafka Delivery Errors per minute on cp3019 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [30.0] [13:46:19] RECOVERY - Varnishkafka Delivery Errors per minute on cp3019 is OK: OK: Less than 1.00% above the threshold [0.0] [13:47:29] PROBLEM - Varnishkafka Delivery Errors per minute on cp3004 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [30.0] [13:49:59] PROBLEM - Varnishkafka Delivery Errors per minute on cp3016 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [30.0] [13:53:19] RECOVERY - Varnishkafka Delivery Errors per minute on cp3004 is OK: OK: Less than 1.00% above the threshold [0.0] [13:55:50] RECOVERY - Varnishkafka Delivery Errors per minute on cp3016 is OK: OK: Less than 1.00% above the threshold [0.0] [14:06:35] (03PS1) 10Faidon Liambotis: install-server: kill rspec for the now-gone mirror [puppet] - 10https://gerrit.wikimedia.org/r/183491 [14:06:37] (03PS1) 10Faidon Liambotis: mirrors: don't purge archvsync logs [puppet] - 10https://gerrit.wikimedia.org/r/183492 [14:06:39] (03PS1) 10Faidon Liambotis: mirrors: move to /srv/mirrors; new class ::serve [puppet] - 10https://gerrit.wikimedia.org/r/183493 [14:09:03] (03PS1) 10Alexandros Kosiaris: RSpec install-server fixes [puppet] - 10https://gerrit.wikimedia.org/r/183494 [14:09:05] (03CR) 10Faidon Liambotis: [C: 032] install-server: kill rspec for the now-gone mirror [puppet] - 10https://gerrit.wikimedia.org/r/183491 (owner: 10Faidon Liambotis) [14:09:21] (03CR) 10Faidon Liambotis: [C: 032] mirrors: don't purge archvsync logs [puppet] - 10https://gerrit.wikimedia.org/r/183492 (owner: 10Faidon Liambotis) [14:10:56] (03CR) 10Alexandros Kosiaris: "This has been done in a better way in https://gerrit.wikimedia.org/r/183494. I will rebase" [puppet] - 10https://gerrit.wikimedia.org/r/183491 (owner: 10Faidon Liambotis) [14:11:16] (03CR) 10Faidon Liambotis: [C: 032] mirrors: move to /srv/mirrors; new class ::serve [puppet] - 10https://gerrit.wikimedia.org/r/183493 (owner: 10Faidon Liambotis) [14:13:41] (03PS1) 10Faidon Liambotis: mirrors: brown-paper bag fix for ::serve [puppet] - 10https://gerrit.wikimedia.org/r/183495 [14:14:13] (03PS2) 10Faidon Liambotis: mirrors: brown paper bag fix for ::serve [puppet] - 10https://gerrit.wikimedia.org/r/183495 [14:14:27] (03CR) 10Faidon Liambotis: [C: 032 V: 032] mirrors: brown paper bag fix for ::serve [puppet] - 10https://gerrit.wikimedia.org/r/183495 (owner: 10Faidon Liambotis) [14:15:58] http://mirrors.wikimedia.org/ ;) [14:16:09] (03PS2) 10Alexandros Kosiaris: RSpec install-server fixes [puppet] - 10https://gerrit.wikimedia.org/r/183494 [14:16:38] * aude wonders why jouncebot didn't announce... we are deploying stuff to wikidata now [14:18:56] well... test.wikidata [14:19:13] (03PS3) 10QChris: Install maven on analytics clients [puppet] - 10https://gerrit.wikimedia.org/r/170668 [14:19:58] (03CR) 10jenkins-bot: [V: 04-1] Install maven on analytics clients [puppet] - 10https://gerrit.wikimedia.org/r/170668 (owner: 10QChris) [14:20:40] (03PS1) 10Faidon Liambotis: mirrros: fix perms for ::debian's log directory [puppet] - 10https://gerrit.wikimedia.org/r/183499 [14:20:51] (03PS4) 10QChris: Install maven on analytics clients [puppet] - 10https://gerrit.wikimedia.org/r/170668 [14:20:58] (03PS2) 10Faidon Liambotis: mirrors: fix perms for ::debian's log directory [puppet] - 10https://gerrit.wikimedia.org/r/183499 [14:21:08] (03CR) 10Faidon Liambotis: [C: 032] mirrors: fix perms for ::debian's log directory [puppet] - 10https://gerrit.wikimedia.org/r/183499 (owner: 10Faidon Liambotis) [14:21:41] (03CR) 10Faidon Liambotis: [V: 032] mirrors: fix perms for ::debian's log directory [puppet] - 10https://gerrit.wikimedia.org/r/183499 (owner: 10Faidon Liambotis) [14:25:50] PROBLEM - Varnishkafka Delivery Errors per minute on cp3008 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [30.0] [14:26:49] PROBLEM - Varnishkafka Delivery Errors per minute on cp3019 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [30.0] [14:27:05] (03PS3) 10Alexandros Kosiaris: RSpec install-server fixes [puppet] - 10https://gerrit.wikimedia.org/r/183494 [14:27:39] PROBLEM - Varnishkafka Delivery Errors per minute on cp3010 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [30.0] [14:29:20] PROBLEM - Varnishkafka Delivery Errors per minute on cp3020 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [30.0] [14:30:40] PROBLEM - Varnishkafka Delivery Errors per minute on cp3015 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [30.0] [14:33:31] RECOVERY - Varnishkafka Delivery Errors per minute on cp3010 is OK: OK: Less than 1.00% above the threshold [0.0] [14:33:50] RECOVERY - Varnishkafka Delivery Errors per minute on cp3019 is OK: OK: Less than 1.00% above the threshold [0.0] [14:34:09] RECOVERY - Varnishkafka Delivery Errors per minute on cp3008 is OK: OK: Less than 1.00% above the threshold [0.0] [14:34:09] PROBLEM - Varnishkafka Delivery Errors on cp3015 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 1260.758301 [14:34:23] !log samarium package updates and reboot [14:34:27] Logged the message, Master [14:35:10] !log aude Started scap: Update group0 to wmf/1.25wmf14 Wikidata extension branch [14:35:13] Logged the message, Master [14:36:06] (03CR) 10Faidon Liambotis: [C: 032] "If you say so :)" [puppet] - 10https://gerrit.wikimedia.org/r/183494 (owner: 10Alexandros Kosiaris) [14:40:19] RECOVERY - Varnishkafka Delivery Errors per minute on cp3020 is OK: OK: Less than 1.00% above the threshold [0.0] [14:41:14] (03PS1) 10Faidon Liambotis: ssh: configure ECDSA & ed25519 host keys [puppet] - 10https://gerrit.wikimedia.org/r/183500 [14:41:29] anyone want to do a *very* careful review of ^? [14:41:29] PROBLEM - Varnishkafka Delivery Errors per minute on cp3022 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [30.0] [14:43:34] (03PS1) 10Aude: Bump cache epoch for test wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183501 [14:43:49] RECOVERY - Varnishkafka Delivery Errors on cp3015 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [14:45:05] aaah forgot a patch that we wanted for scap [14:45:12] not critical though [14:46:00] PROBLEM - Varnishkafka Delivery Errors per minute on cp3019 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [30.0] [14:47:40] RECOVERY - Varnishkafka Delivery Errors per minute on cp3015 is OK: OK: Less than 1.00% above the threshold [0.0] [14:47:50] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [14:47:50] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [14:47:57] (03PS2) 10Faidon Liambotis: ssh: configure ECDSA & ed25519 host keys [puppet] - 10https://gerrit.wikimedia.org/r/183500 [14:48:40] RECOVERY - Varnishkafka Delivery Errors per minute on cp3022 is OK: OK: Less than 1.00% above the threshold [0.0] [14:49:57] chasemp: hey, here? [14:52:10] PROBLEM - Varnishkafka Delivery Errors per minute on cp3020 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [30.0] [14:55:10] PROBLEM - Varnishkafka Delivery Errors on cp3020 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 657.916687 [14:57:30] PROBLEM - Varnishkafka Delivery Errors per minute on cp3021 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [30.0] [14:59:30] PROBLEM - Varnishkafka Delivery Errors per minute on cp3022 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [30.0] [14:59:40] qchris: any idea what's up with that? [15:00:33] paravoid: just about to head into a meeting. [15:00:36] YuviPanda: here [15:00:38] kart_: hey [15:00:40] (03CR) 10Anomie: monolog: honor log sampling and levels for logstash (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181350 (https://phabricator.wikimedia.org/T85067) (owner: 10BryanDavis) [15:00:46] Those errors are not good, but ok. [15:00:50] kart_: as I mentioned, usually these kind of cronjobs run on stat* machines [15:00:59] YuviPanda: it is sql query you seen. [15:01:13] so you can just put in ops/puppet, set up a small corn job in misc/statistics.pp [15:01:21] kart_: ottomata or milimetric would be the best people to ask about this. [15:01:30] Thanks. [15:01:42] There'll be some more testing on our end to see what's going on. [15:01:54] !log aude Finished scap: Update group0 to wmf/1.25wmf14 Wikidata extension branch (duration: 26m 43s) [15:01:57] Logged the message, Master [15:02:00] PROBLEM - Varnishkafka Delivery Errors per minute on cp3008 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [30.0] [15:02:39] paravoid: ottomata is at it :-/ [15:03:10] PROBLEM - Varnishkafka Delivery Errors per minute on cp3016 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [30.0] [15:04:15] kart_: I love corn, but I can also help you with any cron jobs you need :) [15:04:49] RECOVERY - Varnishkafka Delivery Errors on cp3020 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [15:04:50] RECOVERY - Varnishkafka Delivery Errors per minute on cp3021 is OK: OK: Less than 1.00% above the threshold [0.0] [15:05:08] hm, i obviously need to chang ethe thresholds [15:05:11] these should all be warnings [15:06:37] (03CR) 10Aude: [C: 032] Bump cache epoch for test wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183501 (owner: 10Aude) [15:06:43] (03Merged) 10jenkins-bot: Bump cache epoch for test wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183501 (owner: 10Aude) [15:06:54] (03PS7) 10BryanDavis: monolog: honor log sampling and levels for logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181350 (https://phabricator.wikimedia.org/T85067) [15:07:58] (03CR) 10BryanDavis: monolog: honor log sampling and levels for logstash (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181350 (https://phabricator.wikimedia.org/T85067) (owner: 10BryanDavis) [15:07:59] RECOVERY - Varnishkafka Delivery Errors per minute on cp3022 is OK: OK: Less than 1.00% above the threshold [0.0] [15:08:32] !log aude Synchronized wmf-config/Wikibase.php: Bump cache epoch for test.wikidata (duration: 00m 06s) [15:08:36] Logged the message, Master [15:08:50] RECOVERY - Varnishkafka Delivery Errors per minute on cp3019 is OK: OK: Less than 1.00% above the threshold [0.0] [15:09:10] RECOVERY - Varnishkafka Delivery Errors per minute on cp3008 is OK: OK: Less than 1.00% above the threshold [0.0] [15:09:19] * aude done (for now) [15:10:12] milimetric: what is the best way to run sql queries using cron? [15:10:19] RECOVERY - Varnishkafka Delivery Errors per minute on cp3016 is OK: OK: Less than 1.00% above the threshold [0.0] [15:10:20] RECOVERY - Varnishkafka Delivery Errors per minute on cp3020 is OK: OK: Less than 1.00% above the threshold [0.0] [15:10:29] adding cron in misc/statistic.pp [15:11:17] 3operations: reclaim lsearchd hosts - https://phabricator.wikimedia.org/T86149#962954 (10chasemp) I expect there will be shame donuts in SF for the 4 a.m. accidental paging :D [15:12:25] hi kart_ [15:12:29] aharoni: hello [15:12:34] chasemp: sigh I wasn't expecting anyone in the US to get paged at 4am [15:12:48] godog: I know it :) no worries, just giving you trouble [15:13:08] I logged onto irc and saw your 'oh crap paging' [15:13:55] hehe two mistakes, one disabling everything at once and the other not double checking host vs service in icinga [15:14:09] ah that old chestnut [15:14:15] all services and...host too! [15:14:20] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [15:14:20] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [15:14:33] I've never ever said, please silence every service but let me know if ping fails...? [15:14:43] silly defaults that carry over from long, long ago [15:15:30] 3operations: Switch HAT appservers to trusty's ICU - https://phabricator.wikimedia.org/T86096#962981 (10chasemp) @faidon, is there someone that would be appropriate for assignee or can we reduce to normal priority? [15:16:16] 3operations: Switch HAT appservers to trusty's ICU - https://phabricator.wikimedia.org/T86096#962988 (10faidon) a:3Joe [15:16:34] 3operations, Wikimedia-SSL-related, Wikimedia-Git-or-Gerrit: Chrome warns about insecure certificate on gerrit.wikimedia.org - https://phabricator.wikimedia.org/T76562#962992 (10chasemp) @mark, it sounds like we need to buy a certificate to make this go away, is that something you would approve? [15:16:43] chasemp: SAL says that _joe_ was already working on it today :D [15:16:50] heh nice [15:17:00] I have a weird tick for high priority things with no assignee :) [15:17:09] PROBLEM - Router interfaces on mr1-esams is CRITICAL: CRITICAL: host 91.198.174.247, interfaces up: 36, down: 1, dormant: 0, excluded: 1, unused: 0BRge-0/0/0: down - Core: msw-oe12-esamsBR [15:17:18] <_joe_> paravoid: :P [15:17:41] <_joe_> chasemp: that's paravoid teasing me [15:17:51] note that I didn't explicitly set the priority [15:17:54] *woosh* on me then [15:18:01] (03PS1) 10Yuvipanda: labs: Remove explicitly set apt timeout [puppet] - 10https://gerrit.wikimedia.org/r/183505 [15:18:03] (03PS1) 10Yuvipanda: network: Add shinken-01 to monitoring_hosts for labs [puppet] - 10https://gerrit.wikimedia.org/r/183506 (https://phabricator.wikimedia.org/T86143) [15:18:03] the parent ticket is p:high [15:18:07] and create subtask copies that over [15:18:08] ahhh [15:18:23] actually I didn't catch that makes sense, the parent would set the tone [15:18:28] _joe_: https://gerrit.wikimedia.org/r/#/c/183500/ [15:18:51] 3operations, Wikimedia-SSL-related, Wikimedia-Git-or-Gerrit: Chrome warns about insecure certificate on gerrit.wikimedia.org - https://phabricator.wikimedia.org/T76562#963006 (10mark) >>! In T76562#961213, @Chad wrote: > We tried putting it behind misc-web-lb but we had problems with the git protocol behaving th... [15:19:09] Coren: want to +1 the two changes aaboovee ^^? [15:19:28] 3operations, MediaWiki-Core-Team, MediaWiki-ResourceLoader: Bad cache stuck due to race condition with scap between different web servers - https://phabricator.wikimedia.org/T47877#963015 (10chasemp) p:5High>3Normal Seems like no meaningful update in a month, no assignee, and no direct actionable that I can... [15:19:35] <_joe_> paravoid: srsly? [15:19:47] srsly... [15:20:02] YuviPanda: Looking [15:20:18] milimetric: also want to run script on few wikis. How to do that via cron? [15:20:43] _joe_: could I assign you https://phabricator.wikimedia.org/T86081 [15:20:49] seems like mostly a container issue really [15:20:58] but I'm not sure how all the hhvm tickets are organized [15:21:01] (03PS2) 10coren: labs: Remove explicitly set apt timeout [puppet] - 10https://gerrit.wikimedia.org/r/183505 (owner: 10Yuvipanda) [15:21:07] _joe_: review would be welcome :) [15:21:17] (03CR) 10coren: [C: 031] "Seems not insane." [puppet] - 10https://gerrit.wikimedia.org/r/183505 (owner: 10Yuvipanda) [15:21:20] I'm guessing it's something you'd care about :P [15:21:24] <_joe_> paravoid: yeah give me 5 minutes :) [15:21:28] (03PS2) 10coren: network: Add shinken-01 to monitoring_hosts for labs [puppet] - 10https://gerrit.wikimedia.org/r/183506 (https://phabricator.wikimedia.org/T86143) (owner: 10Yuvipanda) [15:21:58] (03CR) 10coren: [C: 031] "Yep, that's a monitoring host." [puppet] - 10https://gerrit.wikimedia.org/r/183506 (https://phabricator.wikimedia.org/T86143) (owner: 10Yuvipanda) [15:22:09] (03CR) 10Faidon Liambotis: [C: 04-1] "apt-conf.d is not puppet-managed. Removing the stanza does not remove the file. apt::conf does accept $ensure, though." [puppet] - 10https://gerrit.wikimedia.org/r/183505 (owner: 10Yuvipanda) [15:22:35] paravoid: hmm, I don’t particularly care what the time out is, fwiw. [15:22:48] and was mostly removing it for hygiene reasons [15:22:54] but I guess that’ll leave us with inconsistent timeouts [15:23:13] sure but different timeouts between old & new hosts might make things interesting [15:23:19] you can remove it and salt rm it too [15:23:28] it's just a file :) [15:24:16] paravoid: yeah, let me do that instead. [15:24:36] salt runs on all hosts, puppet has been failing on some betalabs hosts for… a long time [15:24:56] 3operations: Complete the use of HHVM over Zend PHP on the Wikimedia cluster - https://phabricator.wikimedia.org/T86081#963054 (10Joe) TMH conversion is a mostly a subset of the imagescalers' issue. I don't see how we need to eradicate zend everywhere including snapshots and deployment servers in order >>! In... [15:25:27] (03CR) 10Yuvipanda: "I'll salt rm the file instead, though :)" [puppet] - 10https://gerrit.wikimedia.org/r/183505 (owner: 10Yuvipanda) [15:26:05] paravoid: remove -1? :) [15:26:35] (03CR) 10Faidon Liambotis: [C: 031] "WFM" [puppet] - 10https://gerrit.wikimedia.org/r/183505 (owner: 10Yuvipanda) [15:26:44] (03CR) 10Faidon Liambotis: network: Add shinken-01 to monitoring_hosts for labs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/183506 (https://phabricator.wikimedia.org/T86143) (owner: 10Yuvipanda) [15:26:46] (03CR) 10Yuvipanda: [C: 032] labs: Remove explicitly set apt timeout [puppet] - 10https://gerrit.wikimedia.org/r/183505 (owner: 10Yuvipanda) [15:26:55] milimetric: in short: 1. we want to run sql query that gets no. of user enabled beta feature (content translation) 2. we want to run it on 8 wikis. 3. get output email. That's all :) [15:26:59] (03CR) 10Yuvipanda: [C: 032] network: Add shinken-01 to monitoring_hosts for labs [puppet] - 10https://gerrit.wikimedia.org/r/183506 (https://phabricator.wikimedia.org/T86143) (owner: 10Yuvipanda) [15:27:07] aharoni: ^ correct me. [15:27:45] 3operations, Wikimedia-SSL-related, Wikimedia-Git-or-Gerrit: Chrome warns about insecure certificate on gerrit.wikimedia.org - https://phabricator.wikimedia.org/T76562#963063 (10Chad) I don't remember to be honest. We could certainly test it. [15:29:18] kart_: and you have access to stat1003? [15:30:50] _joe_: btw, there are a few mw* alerts, e.g. mw1216, "500 internal server error", no SAL entry [15:30:56] er, mw1226, sorry [15:31:01] is that you or something else? [15:32:42] <_joe_> paravoid: something else [15:32:48] <_joe_> so I'll take a look [15:33:15] I can as well, it's not like every mw error from now on is exclusively your responsibility :) [15:33:29] <_joe_> mw1113 is donw since 7 days, meh [15:33:49] it's just that you've been doing a lot of work the last few months and it's easy to lose track of real alerts [15:34:17] <_joe_> paravoid: yes of course, let's say that now most hhvm and apache related alerts are "real" [15:37:06] <_joe_> !log restarting hhvm on mw1113, stuck in parsing the ini file (HPHP::is_valid_var_name) [15:37:09] Logged the message, Master [15:37:16] :( [15:39:47] <_joe_> paravoid: happens sometimes, this is the weirdest deadlock I've seen from HHVM [15:41:31] RECOVERY - Apache HTTP on mw1113 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.052 second response time [15:41:31] RECOVERY - HHVM rendering on mw1113 is OK: HTTP OK: HTTP/1.1 200 OK - 66465 bytes in 0.190 second response time [15:42:39] milimetric: nope [15:43:59] RECOVERY - Apache HTTP on mw1226 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 440 bytes in 0.040 second response time [15:44:00] <_joe_> !log restarting hhvm on mw1226, TC full [15:44:05] Logged the message, Master [15:44:06] tc? [15:44:15] kart_: can we talk in #wikimedia-analytics ? [15:44:19] RECOVERY - HHVM rendering on mw1226 is OK: HTTP OK: HTTP/1.1 200 OK - 66453 bytes in 0.570 second response time [15:44:21] <_joe_> translation cache if I'm not wrong [15:45:03] ok [15:45:12] PROBLEM - puppet last run on wtp1020 is CRITICAL: CRITICAL: Puppet last ran 2 days ago [15:45:12] and what's with the Cannot assign requested address for nutcracker alerts? [15:45:22] that come and go for random servers [15:46:21] RECOVERY - puppet last run on wtp1020 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [15:46:29] <_joe_> yes, I think it's a problem with the check, maybe we're exhausting ephemeral ports, but it's too transient to diagnose. I thought it may be related to the other mc alerts we're getting on the more heavily trafficated servers [15:46:32] cscott: ping? [15:46:46] (03PS1) 10Alexandros Kosiaris: Followup commit for 51df8f0 [puppet] - 10https://gerrit.wikimedia.org/r/183512 [15:47:15] <_joe_> paravoid: we're going to use the unix socket very soon, I was actually about to give it a shot on a couple of servers where we added it to the config [15:47:19] marktraceur, ^d, (manybubbles seems to not be here to ping): Who wants to SWAT today? [15:47:34] Hm [15:47:35] ok [15:47:37] I can [15:47:41] marktraceur: ok! [15:47:53] <_joe_> so my plan is, if they don't go away after we switched, I'll dig deeper [15:48:13] <^d> What's on the list other than the AbuseFilter stuff? [15:48:14] makes sense [15:48:20] <^d> jouncebot_: next [15:48:20] In 0 hour(s) and 11 minute(s): Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150108T1600) [15:48:32] <^d> Ah, ok [15:48:35] <_joe_> this reminds me, I should make the mediawiki-config change [15:49:36] (03CR) 10Alexandros Kosiaris: [C: 032] Followup commit for 51df8f0 [puppet] - 10https://gerrit.wikimedia.org/r/183512 (owner: 10Alexandros Kosiaris) [15:50:20] (03PS1) 10Ottomata: Fix up some /srv /a discrepencies on stat1001 [puppet] - 10https://gerrit.wikimedia.org/r/183513 [15:50:30] (03PS2) 10Ottomata: Fix up some /srv /a discrepencies on stat1001 [puppet] - 10https://gerrit.wikimedia.org/r/183513 [15:50:36] Does anyone know the details behind https://gerrit.wikimedia.org/r/#/c/159850/ ? [15:51:16] it left a (now broken) VE install behind [15:51:21] (03CR) 10Ottomata: [C: 032] Fix up some /srv /a discrepencies on stat1001 [puppet] - 10https://gerrit.wikimedia.org/r/183513 (owner: 10Ottomata) [15:52:01] milimetric: sure [15:52:51] (03CR) 10Nikerabbit: [C: 031] Beta: Fix spacing and indentation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/178772 (owner: 10KartikMistry) [15:53:50] (03CR) 10Alexandros Kosiaris: [C: 032] Add the apertium.svc.eqiad.wmnet DNS record [dns] - 10https://gerrit.wikimedia.org/r/183223 (owner: 10Alexandros Kosiaris) [15:53:57] Krenair: disabling Parsoid at loginwiki won't break anything, right? [15:54:08] (03PS1) 10Alexandros Kosiaris: Normalize checkcommands.cfg whitespace [puppet] - 10https://gerrit.wikimedia.org/r/183514 [15:54:27] (03CR) 10Nikerabbit: [C: 031] "Now only waiting for service to become available?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181546 (owner: 10KartikMistry) [15:54:55] Glaisher, I can't think of why it would... [15:55:49] (03CR) 10BBlack: [C: 04-1] "Ultimately, while I think this is closer than the previous PS's, this still seems like a sketchy approach." [puppet] - 10https://gerrit.wikimedia.org/r/183171 (owner: 10Ori.livneh) [15:55:49] hm [15:56:03] https://phabricator.wikimedia.org/T61702 [15:56:04] (03PS7) 10BryanDavis: logstash: Parse apache syslog messages [puppet] - 10https://gerrit.wikimedia.org/r/179480 [15:56:31] (03PS8) 10BryanDavis: logstash: parse json encoded hhvm fatal errors [puppet] - 10https://gerrit.wikimedia.org/r/179759 [15:56:35] _joe_: Am I deploying a config change for you? [15:56:49] <_joe_> marktraceur: not now [15:56:53] bd808, legoktm, James_F, ping for SWAT [15:57:01] pong [15:57:08] <_joe_> I'm still working on it and it's a smallish one that I can sync afterwards [15:57:10] PROBLEM - Varnishkafka Delivery Errors per minute on cp3021 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [30.0] [15:57:55] marktraceur: I would suggest putting my config patch last on the off chance that it needs to be rolled back quickly [15:58:52] (03CR) 10MarkTraceur: [C: 032] monolog: honor log sampling and levels for logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181350 (https://phabricator.wikimedia.org/T85067) (owner: 10BryanDavis) [15:58:57] Oh. [15:59:05] heh. it will be fine :) [15:59:10] bd808: I was just thinking I'd do yours first because you're ready and it seems OK [15:59:19] And because config changes are easier [15:59:26] marktraceur: pong [15:59:35] Yay hi legoktm [16:00:04] manybubbles, anomie, ^d, marktraceur: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150108T1600). Please do the needful. [16:00:29] All right, lets break the sites with bd808 [16:01:05] Glaisher, loginwiki literally only has the main page, which is restricted editing to local sysops and a bunch of global groups [16:01:08] (03Merged) 10jenkins-bot: monolog: honor log sampling and levels for logstash [mediawiki-config] - 10https://gerrit.wikimedia.org/r/181350 (https://phabricator.wikimedia.org/T85067) (owner: 10BryanDavis) [16:02:33] !log marktraceur Synchronized wmf-config/logging.php: [SWAT] Honor log sampling and levels for logstash on group0 wikis (duration: 00m 05s) [16:02:36] Logged the message, Master [16:02:44] bd808: Looks like no crashing [16:02:49] bd808: Test? [16:02:56] Krenair: and a page in mw ns :P [16:02:58] sp-contributions-footer, I think [16:03:13] marktraceur: Still here. [16:03:19] well... hard to test directly. Let me hit some group0 pages though [16:03:27] James_F: Sweet [16:03:39] bd808: K, I'll move on to legoktm [16:03:41] RECOVERY - Varnishkafka Delivery Errors per minute on cp3021 is OK: OK: Less than 1.00% above the threshold [0.0] [16:03:58] Glaisher, yeah, which has the same editing restrictions as the main page [16:04:00] Ugh, legoktm, I have to make your submodule patches? [16:04:04] marktraceur: oh, I can do that [16:04:11] That would be super [16:04:16] I can do James_F in the meantimme [16:04:25] And then I can do his patches. [16:04:34] Quite. [16:05:12] Glaisher, actually can local sysops edit that page? I'm not sure they can [16:05:37] local sysops? I don't think there are any [16:05:56] only stews and editinterface can edit there, I think [16:06:41] Glaisher, yes, also staff and sysadmin [16:06:42] Jenkins is doing its thing [16:06:49] yeah [16:07:04] marktraceur: My patch is looking good. Things are still being logged and the frequency of GlobalTitleFail events is dropping (it is supposed to be sampled as wasn't previously). [16:07:15] bd808: Great! [16:07:26] I will consider it a success. [16:07:45] This was a triumph. [16:07:57] keep calm and log on [16:08:28] (03CR) 10Alex Monk: "Do you remember why this was done, Andrew?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/159850 (owner: 10Andrew Bogott) [16:08:55] (03PS1) 10Alexandros Kosiaris: apertium-eo-en instead of apertium-en-eo [puppet] - 10https://gerrit.wikimedia.org/r/183516 [16:09:26] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-eo-en instead of apertium-en-eo [puppet] - 10https://gerrit.wikimedia.org/r/183516 (owner: 10Alexandros Kosiaris) [16:09:49] Thanks legoktm [16:10:07] marktraceur: https://gerrit.wikimedia.org/r/183517 and https://gerrit.wikimedia.org/r/183518 [16:10:40] I'll merge wmf13 now, Jenkins can consider it while I deploy James_F's wmf14 patch. [16:12:10] * James_F nods. [16:12:17] And we're off! [16:12:32] 3operations, ContentTranslation-cxserver: Deploy apertium in production - https://phabricator.wikimedia.org/T86026#963120 (10akosiaris) 5Open>3Resolved All of the above have been merged and apertium is deployed in production. It answers on the apertium.svc.eqiad.wmnet hostname on port 2737 and monitoring is... [16:14:21] James_F: Going now, be ready to test! [16:14:29] !log marktraceur Synchronized php-1.25wmf14/extensions/VisualEditor/: [SWAT] VisualEditor fixes for T86046 and T86056 (duration: 00m 05s) [16:14:33] James_F: And test. [16:14:35] Logged the message, Master [16:14:39] marktraceur: Am waiting for RL cache. [16:14:50] OK! [16:15:01] marktraceur: And bingo, works great. Thanks! [16:15:22] Awesome [16:15:32] legoktm: Arright, just hit +2 on wmf14, waiting for jenkins [16:15:36] AS USUAL. [16:15:49] Jenkins shouldn't be a butler [16:15:54] He should be a security guard [16:17:46] wut [16:17:52] marktraceur: denied [16:19:22] PROBLEM - Varnishkafka Delivery Errors per minute on cp3010 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [30.0] [16:21:11] ...huh. [16:21:13] legoktm: Denied [16:21:22] PROBLEM - Varnishkafka Delivery Errors per minute on cp3022 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [30.0] [16:21:48] > maybe your computer is too fast? [16:21:53] Fucking Jenkins [16:22:19] Quick, everyone ddos gallium. [16:23:02] that test has been flaky lately [16:23:33] Yeah, I submitted a recheck [16:23:36] So we'll be here a while. [16:23:51] (03PS1) 10Ottomata: Increase critical threshold for varnishkafka drerr [puppet] - 10https://gerrit.wikimedia.org/r/183520 [16:23:51] wmf14 is go! [16:24:36] (03PS2) 10Ottomata: Increase critical threshold for varnishkafka drerr [puppet] - 10https://gerrit.wikimedia.org/r/183520 [16:25:06] !log marktraceur Synchronized php-1.25wmf14/extensions/AbuseFilter: [SWAT] [AbuseFilter] Add file_size variable (duration: 00m 06s) [16:25:08] legoktm: Testy test [16:25:09] Logged the message, Master [16:26:51] marktraceur: looks good, but this is going to need a scap :/ [16:27:11] Oh dear. [16:27:19] Well I'd better get the other patch ready to go first [16:27:24] * bd808 shakes fist at l10n cache [16:27:31] RECOVERY - Varnishkafka Delivery Errors per minute on cp3022 is OK: OK: Less than 1.00% above the threshold [0.0] [16:27:34] I'm sure I should be saying something like "scap? During SWAT? Shame." but whatever [16:27:47] (03CR) 10Ottomata: [C: 032] Increase critical threshold for varnishkafka drerr [puppet] - 10https://gerrit.wikimedia.org/r/183520 (owner: 10Ottomata) [16:27:53] marktraceur: or we can just hope no one looks at that part of the AbuseFilter for a few hours until Reedy scaps later today [16:28:01] RECOVERY - Varnishkafka Delivery Errors per minute on cp3010 is OK: OK: Less than 1.00% above the threshold [0.0] [16:28:05] legoktm: I'm a man of action! [16:28:19] Assuming, that is, that jenkins passes your patch. [16:28:39] legoktm: I'm not scapping today... [16:28:45] Also that. [16:28:50] 3operations, ops-core: install/deploy codfw appservers - https://phabricator.wikimedia.org/T85227#963162 (10RobH) once the rest of the ports are setup, i'll start installs on these (chatted with Joe about this in IRC just now) [16:28:52] PROBLEM - Varnishkafka Delivery Errors per minute on cp3006 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [30.0] [16:29:11] oh :| [16:29:29] bah, today is thursday not wednesday [16:29:54] * marktraceur pets legoktm [16:30:12] Wikidata are supposed to be scapping today [16:30:15] Not sure what time though [16:30:30] Earlier [16:32:13] PROBLEM - Varnishkafka Delivery Errors per minute on cp3004 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [30.0] [16:32:15] (03PS1) 10Yuvipanda: network: Add shinken test server to $monitoring_hosts ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/183521 (https://phabricator.wikimedia.org/T86143) [16:32:23] (03CR) 10jenkins-bot: [V: 04-1] network: Add shinken test server to $monitoring_hosts ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/183521 (https://phabricator.wikimedia.org/T86143) (owner: 10Yuvipanda) [16:32:40] (03PS2) 10Yuvipanda: network: Add shinken test server to $monitoring_hosts ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/183521 (https://phabricator.wikimedia.org/T86143) [16:33:02] Jenkins +2'd it but didn't submit [16:33:09] So annoying. [16:34:19] OK, legoktm, we're set to scap [16:34:55] (03CR) 10Yuvipanda: [C: 032] network: Add shinken test server to $monitoring_hosts ferm rule [puppet] - 10https://gerrit.wikimedia.org/r/183521 (https://phabricator.wikimedia.org/T86143) (owner: 10Yuvipanda) [16:34:59] !log marktraceur Started scap: [SWAT] [AbuseFilter] Add file_size variable [16:35:06] Logged the message, Master [16:35:06] Heeeeeere we go! [16:35:13] woot [16:36:11] RECOVERY - Varnishkafka Delivery Errors per minute on cp3006 is OK: OK: Less than 1.00% above the threshold [0.0] [16:36:31] It's not like we're on a schedule, there's no other deploy scheduled until the evening SWAT "tomorrow" [16:37:27] (03PS1) 10Giuseppe Lavagetto: memcached: use a unix socket instead of a tcp connection on selected hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183526 [16:38:22] RECOVERY - Varnishkafka Delivery Errors per minute on cp3004 is OK: OK: Less than 1.00% above the threshold [0.0] [16:38:40] (03PS1) 10coren: Labs: Make dynamic proxies use local resolver [puppet] - 10https://gerrit.wikimedia.org/r/183527 [16:39:49] (03CR) 10Giuseppe Lavagetto: [C: 031] "please do!" [puppet] - 10https://gerrit.wikimedia.org/r/183500 (owner: 10Faidon Liambotis) [16:40:10] YuviPanda: ^^ [16:40:18] _joe_: moar coming [16:40:34] PROBLEM - Varnishkafka Delivery Errors per minute on cp3003 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [30.0] [16:42:19] (03CR) 10Yuvipanda: [C: 04-1] Labs: Make dynamic proxies use local resolver (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/183527 (owner: 10coren) [16:43:42] (03PS3) 10Faidon Liambotis: ssh: configure ECDSA & ed25519 host keys [puppet] - 10https://gerrit.wikimedia.org/r/183500 [16:43:44] (03PS1) 10Faidon Liambotis: ssh: export ECDSA & ed25519 hostkeys [puppet] - 10https://gerrit.wikimedia.org/r/183528 [16:43:46] (03PS1) 10Faidon Liambotis: ssh: use sshkey's host_aliases instead of three keys [puppet] - 10https://gerrit.wikimedia.org/r/183529 [16:43:48] (03PS1) 10Faidon Liambotis: ssh: remove ssh::hostkey definition [puppet] - 10https://gerrit.wikimedia.org/r/183530 [16:43:50] (03PS1) 10Faidon Liambotis: ssh: purge unmanaged ssh_known_hosts keys [puppet] - 10https://gerrit.wikimedia.org/r/183531 [16:44:33] there [16:44:41] _joe_: ^ [16:46:11] (03CR) 10coren: "Responses inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/183527 (owner: 10coren) [16:46:14] RECOVERY - Varnishkafka Delivery Errors per minute on cp3003 is OK: OK: Less than 1.00% above the threshold [0.0] [16:46:53] <_joe_> oh [16:48:14] (03PS2) 10coren: Labs: Make dynamic proxies use local resolver [puppet] - 10https://gerrit.wikimedia.org/r/183527 [16:48:37] (03CR) 10Ottomata: "Puppetizing the file is the proper thing to do :) Make a maven puppet module?" [puppet] - 10https://gerrit.wikimedia.org/r/170668 (owner: 10QChris) [16:51:15] !log Restarted Zuul. Same issue as https://wikitech.wikimedia.org/wiki/Incident_documentation/20150106-Zuul [16:51:17] Ugh come on scap [16:51:20] Logged the message, Master [16:52:10] (03CR) 10Yuvipanda: Labs: Make dynamic proxies use local resolver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/183527 (owner: 10coren) [16:52:25] Coren: ^ [16:54:56] (03CR) 10Giuseppe Lavagetto: [C: 031] ssh: export ECDSA & ed25519 hostkeys [puppet] - 10https://gerrit.wikimedia.org/r/183528 (owner: 10Faidon Liambotis) [16:58:56] (03CR) 10Giuseppe Lavagetto: [C: 031] ssh: use sshkey's host_aliases instead of three keys [puppet] - 10https://gerrit.wikimedia.org/r/183529 (owner: 10Faidon Liambotis) [17:00:53] (03CR) 10Giuseppe Lavagetto: [C: 031] ssh: remove ssh::hostkey definition [puppet] - 10https://gerrit.wikimedia.org/r/183530 (owner: 10Faidon Liambotis) [17:01:39] (03CR) 10Yuvipanda: [C: 031] Labs: Make dynamic proxies use local resolver [puppet] - 10https://gerrit.wikimedia.org/r/183527 (owner: 10coren) [17:01:41] Coren: ^ [17:01:44] <_joe_> paravoid: re https://gerrit.wikimedia.org/r/#/c/183531/ - I was trying to think at cases where we don't want to purge unmanaged host keys [17:02:40] <_joe_> I don't see any [17:04:25] 3operations, Continuous-Integration: Acquire old production API servers for use in CI - https://phabricator.wikimedia.org/T84940#963335 (10hashar) We might have a use for them to set some CI supporting servers straight inside the labs infrastructure. That would host the Zuul mergers which provides the patch set... [17:05:46] marktraceur: its still going? [17:06:11] legoktm: Yup [17:06:14] (03CR) 10Giuseppe Lavagetto: [C: 031] ssh: purge unmanaged ssh_known_hosts keys [puppet] - 10https://gerrit.wikimedia.org/r/183531 (owner: 10Faidon Liambotis) [17:06:15] It's scap. It's slow. [17:08:26] !log marktraceur Finished scap: [SWAT] [AbuseFilter] Add file_size variable (duration: 33m 27s) [17:08:28] Logged the message, Master [17:08:29] Yay! [17:08:31] legoktm: Testy test [17:08:54] (03CR) 10coren: [C: 032] "Test away! Fire in the hole!" [puppet] - 10https://gerrit.wikimedia.org/r/183527 (owner: 10coren) [17:09:12] Ah, hm. Where is jenkings-bot when you need it? [17:09:23] Is it frozed? [17:09:42] Coren: https://integration.wikimedia.org/zuul/ says nothing in the queues [17:09:43] Coren: Zuul deadlocked, gotta 'recheck' the patch :d [17:09:48] Neat. [17:09:50] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/183527 (owner: 10coren) [17:10:08] Gerrit is flappy apparently [17:10:39] who is the recheck limited to, again? [17:10:48] "trusted" people, but I assume that means some LDAP group? [17:10:50] there is a bug in Zuul that doesn't properly handle Gerrit stalling an ssh connection :/ [17:10:51] 3operations, Project-Creators: HTTPS phabricator project(s) - https://phabricator.wikimedia.org/T86063#963357 (10Aklapper) I propose to 1) kill T29946 (it once was about secure.wikimedia.org and then its scope got incorrectly broadened) and 2) rename "Wikimedia-SSL-related" to "HTTPS". Plus for both cases, when... [17:11:05] greg-g: manually maintained list of emails in the zuul conf [17:11:13] heh, "awesome" [17:11:14] marktraceur: looks good! thanks :D [17:11:17] greg-g: though *@wikimedia.org accounts are whitelisted [17:11:18] Awesome [17:11:20] SWAT IS OVER [17:11:23] hashar: ahh, good [17:11:33] I am off [17:11:37] g'night! [17:11:39] if in doubt, restart Zuul on gallium [17:12:10] (03PS4) 10Filippo Giunchedi: graphite: introduce local c-relay [puppet] - 10https://gerrit.wikimedia.org/r/181080 (https://phabricator.wikimedia.org/T85908) [17:12:22] so, swat is over over? [17:12:26] can I break ssh across the fleet now? [17:12:31] <_joe_> eheh [17:12:37] <_joe_> I don't think you will [17:12:41] <_joe_> (last famous words) [17:13:20] (03PS2) 10Faidon Liambotis: ssh: use sshkey's host_aliases instead of three keys [puppet] - 10https://gerrit.wikimedia.org/r/183529 [17:13:22] (03PS2) 10Faidon Liambotis: ssh: export ECDSA & ed25519 hostkeys [puppet] - 10https://gerrit.wikimedia.org/r/183528 [17:13:24] (03PS2) 10Faidon Liambotis: ssh: purge unmanaged ssh_known_hosts keys [puppet] - 10https://gerrit.wikimedia.org/r/183531 [17:13:26] (03PS2) 10Faidon Liambotis: ssh: remove ssh::hostkey definition [puppet] - 10https://gerrit.wikimedia.org/r/183530 [17:13:28] paravoid: There is no doubt you can. I'm pretty sure you shouldn't though. :-) [17:13:28] (03PS4) 10Faidon Liambotis: ssh: configure ECDSA & ed25519 host keys [puppet] - 10https://gerrit.wikimedia.org/r/183500 [17:13:34] paravoid: yeah, you're good for a few hours [17:13:55] (03CR) 10Faidon Liambotis: [C: 032] ssh: configure ECDSA & ed25519 host keys [puppet] - 10https://gerrit.wikimedia.org/r/183500 (owner: 10Faidon Liambotis) [17:14:38] (03PS1) 10Filippo Giunchedi: graphite: archive received metrics on disk [puppet] - 10https://gerrit.wikimedia.org/r/183537 (https://phabricator.wikimedia.org/T85908) [17:15:06] (03PS2) 10Filippo Giunchedi: gdash: add carbon-cache utilization dashboard [puppet] - 10https://gerrit.wikimedia.org/r/183476 [17:15:08] (03CR) 10Faidon Liambotis: [C: 032] ssh: export ECDSA & ed25519 hostkeys [puppet] - 10https://gerrit.wikimedia.org/r/183528 (owner: 10Faidon Liambotis) [17:15:25] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] gdash: add carbon-cache utilization dashboard [puppet] - 10https://gerrit.wikimedia.org/r/183476 (owner: 10Filippo Giunchedi) [17:15:44] PROBLEM - Varnishkafka Delivery Errors per minute on cp3018 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [20000.0] [17:15:59] nooo [17:16:06] paravoid: sigh sorry I thought you already merged it, there'll be my harmless change [17:16:08] there goes my rebase :) [17:16:23] (03CR) 10Faidon Liambotis: [C: 032] ssh: use sshkey's host_aliases instead of three keys [puppet] - 10https://gerrit.wikimedia.org/r/183529 (owner: 10Faidon Liambotis) [17:16:42] * godog to the corner [17:16:49] (03CR) 10Faidon Liambotis: [C: 032] ssh: remove ssh::hostkey definition [puppet] - 10https://gerrit.wikimedia.org/r/183530 (owner: 10Faidon Liambotis) [17:17:18] Gerrit Code Review: Merge "ssh: remove ssh::hostkey definition" into production (cc4d76b) [17:17:21] Gerrit Code Review: Merge "ssh: use sshkey's host_aliases instead of three keys" into production (f13a028) [17:17:24] Gerrit Code Review: Merge "ssh: export ECDSA & ed25519 hostkeys" into production (4df3829) [17:17:27] Gerrit Code Review: Merge "gdash: add carbon-cache utilization dashboard" into production (7a0c489) [17:17:30] :( [17:17:33] sad commit history is sad [17:17:48] <_joe_> sigh [17:18:37] :( [17:19:23] grrrr stupid puppet [17:19:51] desc "The encryption type used. Probably ssh-dss or ssh-rsa." [17:19:54] newvalues :'ssh-dss', :'ssh-rsa', :'ecdsa-sha2-nistp256', :'ecdsa-sha2-nistp384', :'ecdsa-sha2-nistp521' [17:22:24] PROBLEM - puppet last run on bast1001 is CRITICAL: CRITICAL: puppet fail [17:22:41] <_joe_> oh man [17:22:45] <_joe_> I didn't check that [17:23:13] RECOVERY - Varnishkafka Delivery Errors per minute on cp3018 is OK: OK: Less than 1.00% above the threshold [0.0] [17:23:29] (03PS3) 10Faidon Liambotis: ssh: purge unmanaged ssh_known_hosts keys [puppet] - 10https://gerrit.wikimedia.org/r/183531 [17:23:31] (03PS1) 10Faidon Liambotis: ssh: s/ssh-ecdsa/ecdsa-sha2-nistp256/, puppet error [puppet] - 10https://gerrit.wikimedia.org/r/183539 [17:23:44] (03CR) 10Greg Grossmeier: Duplicate -qa notifcations to -releng (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/183382 (https://phabricator.wikimedia.org/T86053) (owner: 10Hashar) [17:23:50] (03CR) 10Faidon Liambotis: [C: 032] ssh: s/ssh-ecdsa/ecdsa-sha2-nistp256/, puppet error [puppet] - 10https://gerrit.wikimedia.org/r/183539 (owner: 10Faidon Liambotis) [17:23:54] (03CR) 10Faidon Liambotis: [V: 032] ssh: s/ssh-ecdsa/ecdsa-sha2-nistp256/, puppet error [puppet] - 10https://gerrit.wikimedia.org/r/183539 (owner: 10Faidon Liambotis) [17:25:35] now I have to wait 20 minutes for those resources to be re-exported [17:25:38] *sigh* [17:25:55] also, I think I'll have to remove ed25519... [17:26:04] I think those auto-merges in the history happen when you don't rebase the final commit on top of whatever's happening at the moment, but gerrit can auto-merge it anyways because no conflict [17:26:32] (03PS1) 10BryanDavis: Change fatalmonitor script to read hhvm.log [puppet] - 10https://gerrit.wikimedia.org/r/183541 [17:26:42] a jessie host would export an ssh-key resource with it, but precise/trusty won't be able to collect it [17:27:08] why on earth is puppet even caring about the type [17:27:10] stupid puppet [17:30:04] PROBLEM - puppet last run on tin is CRITICAL: CRITICAL: puppet fail [17:30:50] (03CR) 10Chad: "@Faidon: It shouldn't be hitting any of those things or it's doing its job wrong. All it should hit is the l10n cache for the "This page i" [puppet] - 10https://gerrit.wikimedia.org/r/182558 (owner: 10OliverKeyes) [17:33:02] 3operations, ops-core: Deploy hhvm 3.3.1+dfsg1-1+wm2 to the production cluster - https://phabricator.wikimedia.org/T85812#963407 (10Joe) a:3Joe [17:35:37] 3operations, ops-core: Deploy hhvm 3.3.1+dfsg1-1+wm2 to the production cluster - https://phabricator.wikimedia.org/T85812#963459 (10Joe) Deplyed today to the canary hosts, if no problem arises it will be extended to all hosts on monday. [17:36:36] 3operations: Localisation update broken on 1.22wmf8 and 1.22wmf9 - https://phabricator.wikimedia.org/T52433#963468 (10greg) p:5Unbreak!>3Normal [17:40:29] (03PS1) 10Yuvipanda: labs: Make dnsmasq return non-zero TTL for lookups [puppet] - 10https://gerrit.wikimedia.org/r/183544 [17:40:32] Coren: ^ [17:40:33] PROBLEM - Varnishkafka Delivery Errors per minute on cp3008 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [20000.0] [17:41:14] PROBLEM - Varnishkafka Delivery Errors per minute on cp3017 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [20000.0] [17:42:09] (03PS2) 10Yuvipanda: labs: Make dnsmasq return non-zero TTL for lookups [puppet] - 10https://gerrit.wikimedia.org/r/183544 (https://phabricator.wikimedia.org/T72076) [17:42:26] (03PS3) 10coren: labs: Make dnsmasq return non-zero TTL for lookups [puppet] - 10https://gerrit.wikimedia.org/r/183544 (https://phabricator.wikimedia.org/T72076) (owner: 10Yuvipanda) [17:43:09] (03CR) 10coren: [C: 031] ""oesn't sound very good" indeed. It's effing inane is what it is." [puppet] - 10https://gerrit.wikimedia.org/r/183544 (https://phabricator.wikimedia.org/T72076) (owner: 10Yuvipanda) [17:43:14] PROBLEM - Varnishkafka Delivery Errors per minute on cp3004 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [20000.0] [17:43:30] Coren: hmm, I wonder if nxdomains will be cached? [17:43:45] hopefully not [17:44:00] (03CR) 10Yuvipanda: [C: 032] labs: Make dnsmasq return non-zero TTL for lookups [puppet] - 10https://gerrit.wikimedia.org/r/183544 (https://phabricator.wikimedia.org/T72076) (owner: 10Yuvipanda) [17:44:06] YuviPanda: Not as much of an issue either way; our traffic is borne out of repeated hits not misses. [17:44:12] Coren: righ. [17:44:19] But worse case scenario is 5min lag - I can live with that. [17:45:16] dnsmasq: self DOS ftw! [17:45:18] Coren: which nodes are these running ona gain? [17:45:22] *on again [17:45:34] labnet1001 iirc [17:45:43] lalabnet? [17:45:44] yeah [17:45:56] Coren: and I’ve to restart nova network? [17:46:14] PROBLEM - Varnishkafka Delivery Errors per minute on cp3018 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [20000.0] [17:46:20] Yes; if you try to restart dnsmasq it goes craycray [17:46:27] yeah [17:46:27] 3operations, Wikimedia-SSL-related: Disable SSL 3.0 on Wikimedia sites to mitigate POODLE attack (CVE-2014-3566) - https://phabricator.wikimedia.org/T74072#963586 (10Dzahn) see my comment above: ticket for upgrading sodium: RT #5420 ticket for deploying new dumps misc hosts RT #4570 ticket for enabling https on... [17:46:34] RECOVERY - Varnishkafka Delivery Errors per minute on cp3008 is OK: OK: Less than 1.00% above the threshold [0.0] [17:46:40] alright, running puppet there now [17:46:51] * Coren keeps an eye on the traffic [17:46:53] PROBLEM - Varnishkafka Delivery Errors per minute on cp3016 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [20000.0] [17:46:53] PROBLEM - Varnishkafka Delivery Errors per minute on cp3021 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [20000.0] [17:47:05] I'll see quickly enough if that sufficed. [17:47:23] RECOVERY - Varnishkafka Delivery Errors per minute on cp3017 is OK: OK: Less than 1.00% above the threshold [0.0] [17:48:52] Coren: I just did a reload [17:49:01] not too sure if that’s enough [17:49:03] Bi change (yet) [17:49:06] No* [17:49:10] yeah [17:49:14] PROBLEM - Varnishkafka Delivery Errors per minute on cp3020 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [20000.0] [17:49:21] Coren: will a restart disrupt anything? [17:49:33] (I should document this on https://wikitech.wikimedia.org/wiki/Labs_DNS when done) [17:49:33] It shouldn't. [17:49:44] The bridges remain up [17:49:53] (03CR) 10Aaron Schulz: [C: 031] memcached: use a unix socket instead of a tcp connection on selected hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183526 (owner: 10Giuseppe Lavagetto) [17:49:59] Coren: restarted [17:50:14] Doesn't seem to have made a change. Odd. [17:50:14] PROBLEM - Varnishkafka Delivery Errors per minute on cp3019 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [20000.0] [17:50:23] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 594.450012 [17:50:31] TTL is still 0 [17:50:34] PROBLEM - Varnishkafka Delivery Errors per minute on cp3022 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [20000.0] [17:50:45] yeah [17:51:06] Coren: restarting nova-network doesn’t actually seem to restart dnsmasq [17:51:10] has same pid before and after [17:51:23] Ah. So you want to down dnsmasq and restart nova-network to get a fresh one. [17:52:14] PROBLEM - Varnishkafka Delivery Errors per minute on cp3015 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [20000.0] [17:52:33] 3operations: Determine Trebuchet/git-deploy maintenance plan - https://phabricator.wikimedia.org/T85008#963632 (10greg) p:5Triage>3High [17:52:46] Hang on, there are two now. [17:52:54] RECOVERY - Varnishkafka Delivery Errors per minute on cp3004 is OK: OK: Less than 1.00% above the threshold [0.0] [17:52:58] (03PS1) 10Aaron Schulz: Use ProfilerSectionOnly to hande DB/filebackend entries and the like [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183545 [17:53:15] Coren: no, there have been two since the start [17:53:18] (03PS1) 10Faidon Liambotis: ssh: do not export ed25519 keys [puppet] - 10https://gerrit.wikimedia.org/r/183546 [17:53:24] Coren: and there’s no upstart file for dnsmasq, but there’s a init.d file [17:53:29] that’s just a wrapper for an upstart file [17:53:31] which doesn’t exist [17:53:32] wat [17:53:34] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [17:53:38] (03CR) 10jenkins-bot: [V: 04-1] ssh: do not export ed25519 keys [puppet] - 10https://gerrit.wikimedia.org/r/183546 (owner: 10Faidon Liambotis) [17:53:39] YuviPanda: That was me. [17:54:08] hmm [17:54:09] huh? [17:54:09] <^d> godog: Any chance you could turn your +1 into a +2 on https://gerrit.wikimedia.org/r/#/c/180210/? [17:54:13] Coren: why? [17:54:23] (03PS2) 10Faidon Liambotis: ssh: do not export ed25519 keys [puppet] - 10https://gerrit.wikimedia.org/r/183546 [17:54:25] (03PS4) 10Faidon Liambotis: ssh: purge unmanaged ssh_known_hosts keys [puppet] - 10https://gerrit.wikimedia.org/r/183531 [17:54:35] YuviPanda: No wait, /that/ wasn't me. [17:55:14] 3operations, Wikimedia-SSL-related: Disable SSL 3.0 on Wikimedia sites to mitigate POODLE attack (CVE-2014-3566) - https://phabricator.wikimedia.org/T74072#963648 (10chasemp) p:5High>3Normal >>! In T74072#963586, @Dzahn wrote: > see my comment above: > > ticket for upgrading sodium: RT #5420 > ticket for de... [17:55:14] RECOVERY - Varnishkafka Delivery Errors per minute on cp3016 is OK: OK: Less than 1.00% above the threshold [0.0] [17:55:16] (03PS5) 10Faidon Liambotis: ssh: purge unmanaged ssh_known_hosts keys [puppet] - 10https://gerrit.wikimedia.org/r/183531 [17:55:17] Ah, everything is back to normal now. [17:55:18] (03PS3) 10Faidon Liambotis: ssh: do not export ssh-ed25519 keys [puppet] - 10https://gerrit.wikimedia.org/r/183546 [17:55:21] YuviPanda: Aaand... BAM! DNS traffic just dropped 90% [17:55:37] Coren: did you kill it by hand and restart nova-network? [17:55:44] (03CR) 10Faidon Liambotis: [C: 032] ssh: do not export ssh-ed25519 keys [puppet] - 10https://gerrit.wikimedia.org/r/183546 (owner: 10Faidon Liambotis) [17:55:49] and yeah, 300s TTL [17:56:09] YuviPanda: Yes, that's what I did in the past when dnsmasq got wedged. I think the previous attempts confused it. [17:56:18] That's how I just fix't it [17:56:21] 3operations: Determine Trebuchet/git-deploy maintenance plan - https://phabricator.wikimedia.org/T85008#936522 (10greg) Added #operations and some other people so we can figure out what to do here. The Trebuchet/git-deploy use case is one that was primarily maintained by ops (aka Ryan) previously and thus their... [17:56:23] Coren: ah, ok. [17:56:31] Coren: but why is there no upstart script? [17:56:45] (03CR) 10Faidon Liambotis: [V: 032] ssh: do not export ssh-ed25519 keys [puppet] - 10https://gerrit.wikimedia.org/r/183546 (owner: 10Faidon Liambotis) [17:56:55] Because dnsmasq is started "manually" by nova-network with its own params and commandline stuff [17:57:04] RECOVERY - Varnishkafka Delivery Errors per minute on cp3018 is OK: OK: Less than 1.00% above the threshold [0.0] [17:57:07] right, but why’s there an init.d wrapper?! [17:57:27] 3operations: LocalisationUpdate broken since 2014-12-16 - https://phabricator.wikimedia.org/T85790#963669 (10greg) >>! In T85790#958288, @Reedy wrote: > Well, the l10nupdate user can't write to the scap lock file as it's not in the wikidev user group.... > > ``` > l10nupdate@tin:/var/log/l10nupdatelog$ groups >... [17:57:40] Because dnsmasq "can" work on its own; it's installed by openstack. But the system config is not compatible with the way nova-network uses it [17:57:53] !swat add https://gerrit.wikimedia.org/r/#/c/180451/ [17:57:59] Any attempt to use 'service dnsmasq *' just breaks dnsmasq for nova entirely. [17:58:09] Coren: even ‘stop’? [17:58:33] It broke something everytime I tried. I think the runfile isn't even in the same place. [17:58:35] 3operations: LocalisationUpdate broken since 2014-12-16 - https://phabricator.wikimedia.org/T85790#954612 (10greg) see also: T76061 [17:58:41] pidfile* [17:58:45] 3operations: LocalisationUpdate broken since 2014-12-16 - https://phabricator.wikimedia.org/T85790#963688 (10Reedy) I think @bd808 suggested "we" just make scap delete the lock file at the end of an operation... [17:59:03] YuviPanda: Actual resolution load on dnsmasq is less that 25% of what it was. [17:59:17] Stupid stupid stupid stupid. TTL 0? Srsly? [17:59:30] Coren: https://wikitech.wikimedia.org/w/index.php?title=Labs_DNS&diff=140226&oldid=139786 [18:00:00] (03CR) 10Faidon Liambotis: [C: 032] ssh: purge unmanaged ssh_known_hosts keys [puppet] - 10https://gerrit.wikimedia.org/r/183531 (owner: 10Faidon Liambotis) [18:00:45] RECOVERY - Varnishkafka Delivery Errors per minute on cp3015 is OK: OK: Less than 1.00% above the threshold [0.0] [18:02:24] RECOVERY - puppet last run on bast1001 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [18:02:44] RECOVERY - Varnishkafka Delivery Errors per minute on cp3022 is OK: OK: Less than 1.00% above the threshold [0.0] [18:03:14] PROBLEM - Varnishkafka Delivery Errors on cp3021 is CRITICAL: kafka.varnishkafka.kafka_drerr.per_second CRITICAL: 458.783325 [18:03:26] huh [18:03:34] 3operations, Release-Engineering: performance testing environment - https://phabricator.wikimedia.org/T67394#963721 (10greg) [18:03:43] RECOVERY - Varnishkafka Delivery Errors per minute on cp3019 is OK: OK: Less than 1.00% above the threshold [0.0] [18:03:46] 12.04 db boxes don't have ecdsa keys [18:04:33] (03CR) 10MaxSem: [C: 04-1] Change fatalmonitor script to read hhvm.log (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/183541 (owner: 10BryanDavis) [18:04:34] Notice: /Stage[main]/Ssh::Hostkeys-collect/Sshkey[srv193.pmtpa.wmnet]/ensure: removed [18:04:40] Notice: /Stage[main]/Ssh::Hostkeys-collect/Sshkey[sq80.wikimedia.org]/ensure: removed [18:04:46] Notice: /Stage[main]/Ssh::Hostkeys-collect/Sshkey[knsq21.esams.wikimedia.org]/ensure: removed [18:04:49] omfg [18:04:53] heh [18:05:08] old old keys? [18:05:48] yeah :) [18:06:09] $ wc -l wmf_known_hosts [18:06:09] 4200 wmf_known_hosts [18:06:22] this is what my computer has [18:06:23] RECOVERY - Varnishkafka Delivery Errors per minute on cp3020 is OK: OK: Less than 1.00% above the threshold [0.0] [18:06:24] RECOVERY - Varnishkafka Delivery Errors on cp3021 is OK: kafka.varnishkafka.kafka_drerr.per_second OKAY: 0.0 [18:06:27] (which I frequently copy from bast1001) [18:06:29] (03PS1) 10BBlack: cp1008 -> jessie installer [puppet] - 10https://gerrit.wikimedia.org/r/183550 [18:06:31] root@bast1001:~# wc -l /etc/ssh/ssh_known_hosts [18:06:32] 819 /etc/ssh/ssh_known_hosts [18:06:58] this is also host_aliases, so it's 4200/3 vs. 819 [18:07:12] still, 1400 vs. 819 isn't so bad :) [18:07:34] 773 [18:07:34] root@bast1001:~# grep -c rsa /etc/ssh/ssh_known_hosts [18:07:34] 43 [18:08:00] it's a shame we can't go to ed25519 where supported [18:08:44] (03CR) 10BBlack: [C: 032] cp1008 -> jessie installer [puppet] - 10https://gerrit.wikimedia.org/r/183550 (owner: 10BBlack) [18:08:45] 3operations: LocalisationUpdate broken since 2014-12-16 - https://phabricator.wikimedia.org/T85790#963747 (10chasemp) >>! In T85790#963669, @greg wrote: >>>! In T85790#958288, @Reedy wrote: >> Well, the l10nupdate user can't write to the scap lock file as it's not in the wikidev user group.... >> >> ``` >> l10n... [18:08:56] (03CR) 10BBlack: [V: 032] cp1008 -> jessie installer [puppet] - 10https://gerrit.wikimedia.org/r/183550 (owner: 10BBlack) [18:09:13] jenkins :p [18:10:22] 3operations, Code-Review, Wikimedia-SSL-related, Wikimedia-Git-or-Gerrit: Chrome warns about insecure certificate on gerrit.wikimedia.org - https://phabricator.wikimedia.org/T76562#963749 (10chasemp) It is relevant for #code-review either way, so we definitely should then [18:10:53] PROBLEM - Varnishkafka Delivery Errors per minute on cp3019 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [20000.0] [18:11:40] 3operations: LocalisationUpdate broken since 2014-12-16 - https://phabricator.wikimedia.org/T85790#963751 (10greg) >>! In T85790#963688, @Reedy wrote: > I think @bd808 suggested "we" just make scap delete the lock file at the end of an operation... This seems like the easier solution. [18:11:50] I wasn't suggesting l10update should be added to wikidev necessarily [18:12:02] Reedy: my bad [18:12:24] PROBLEM - Varnishkafka Delivery Errors per minute on cp3020 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [20000.0] [18:12:34] RECOVERY - puppet last run on tin is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:12:37] Reedy: greg-g: can this be added to swat? https://gerrit.wikimedia.org/r/#/c/180451/ [18:12:39] It would solve it. But unknown possible side effects too [18:13:05] Reedy: can kinda test it in beta? [18:13:39] (03Abandoned) 10Aaron Schulz: Simplified profiler config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/178591 (owner: 10Aaron Schulz) [18:17:23] RECOVERY - Varnishkafka Delivery Errors per minute on cp3021 is OK: OK: Less than 1.00% above the threshold [0.0] [18:18:14] RECOVERY - Varnishkafka Delivery Errors per minute on cp3019 is OK: OK: Less than 1.00% above the threshold [0.0] [18:18:21] 3operations, Project-Creators: HTTPS phabricator project(s) - https://phabricator.wikimedia.org/T86063#963779 (10faidon) >>! In T86063#963357, @Aklapper wrote: > I propose to 1) kill T29946 (it once was about secure.wikimedia.org and then its scope got incorrectly broadened) and 2) rename "Wikimedia-SSL-related"... [18:21:04] RECOVERY - Varnishkafka Delivery Errors per minute on cp3020 is OK: OK: Less than 1.00% above the threshold [0.0] [18:23:58] can someone unstick jenkins? [18:40:07] 3Project-Creators, operations: HTTPS phabricator project(s) - https://phabricator.wikimedia.org/T86063#963816 (10chasemp) Andre, thanks for going over stuff thoroughly. I looked through an indeed a HTTPS tag seems like it will cover this and invalidates the need for T29946 So...I'm going for it. [18:41:32] (03CR) 10Ori.livneh: [C: 031] memcached: use a unix socket instead of a tcp connection on selected hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183526 (owner: 10Giuseppe Lavagetto) [18:43:21] (03CR) 10BryanDavis: Change fatalmonitor script to read hhvm.log (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/183541 (owner: 10BryanDavis) [18:44:38] 3Project-Creators, operations: HTTPS phabricator project(s) - https://phabricator.wikimedia.org/T86063#963837 (10chasemp) 5Open>3Resolved a:3chasemp OK! [18:46:03] 3Wikimedia-Git-or-Gerrit, Code-Review, operations: Chrome warns about insecure certificate on gerrit.wikimedia.org - https://phabricator.wikimedia.org/T76562#963842 (10RobH) I'd suggest we test and use the misc-web-lb for this, since it means one less certificate and therefore less general overhead (since we alr... [18:46:10] PROBLEM - Varnishkafka Delivery Errors per minute on cp3021 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [20000.0] [18:47:03] i wonder how long we're all going to play the 'someone should put this behind misc web lb but i dont wanna volunteer' game for gerrit task there... [18:47:15] cuz it can totally break gerrit. [18:47:43] paravoid: pong? [18:47:56] in theory we can set it all up and test wit local host file right? robh [18:47:57] cscott: i think he just went to dinner [18:48:13] well maybe not, nevermind [18:48:29] more than 30 seconds of thought on it changed my mind I think [18:48:35] you can add it to varnish config but not switch DNS , yea [18:48:40] and then test by hacking /etc/hosts [18:48:41] yea if it wasnt gerrit and its odd shit maybe, but dunno yea [18:48:52] the hosts hack should work [18:49:10] (I think yes but haven't thought it all the way through :) [18:49:21] as long as you dont need 2 backends on the same box [18:50:30] PROBLEM - Varnishkafka Delivery Errors per minute on cp3018 is CRITICAL: CRITICAL: 25.00% of data above the critical threshold [20000.0] [18:53:33] robh, paravoid: the danger of contentless pings [18:54:39] RECOVERY - Varnishkafka Delivery Errors per minute on cp3021 is OK: OK: Less than 1.00% above the threshold [0.0] [18:55:14] (03PS1) 10BBlack: use $::memorysize_mb in cache.pp [puppet] - 10https://gerrit.wikimedia.org/r/183556 [18:55:16] (03PS1) 10BBlack: switch min_free_kbytes to $::memorysize_mb [puppet] - 10https://gerrit.wikimedia.org/r/183557 [18:55:18] (03PS1) 10BBlack: remove custom memorysizeinbytes fact (no longer used) [puppet] - 10https://gerrit.wikimedia.org/r/183558 [18:58:45] !log Attempting manual run of l10nupdate [18:58:50] Logged the message, Master [19:01:30] RECOVERY - Varnishkafka Delivery Errors per minute on cp3018 is OK: OK: Less than 1.00% above the threshold [0.0] [19:05:01] cscott: T76115 [19:06:38] paravoid: ah, right. [19:13:34] !log reedy Synchronized php-1.25wmf13/cache/l10n: (no message) (duration: 00m 01s) [19:13:37] Logged the message, Master [19:13:55] that seems to be a lie [19:14:06] oh, fuck [19:14:26] bd808: ^^ a load of permissions denied public key stuff [19:14:37] !log LocalisationUpdate completed (1.25wmf13) at 2015-01-08 19:14:37+00:00 [19:14:41] Logged the message, Master [19:14:45] yup. that was what I was afraid of [19:15:11] This is nc from the netcat-openbsd package. An alternative nc is available [19:15:11] in the netcat-traditional package. [19:15:11] usage: nc [-46DdhklnrStUuvzC] [-i interval] [-P proxy_username] [-p source_port] [19:15:11] [-s source_ip_address] [-T ToS] [-w timeout] [-X proxy_protocol] [19:15:11] [-x proxy_address[:port]] [hostname] [port[s]] [19:15:19] 3operations, OCG-General-or-Unknown: OCG Queue Length Checks are unclear - https://phabricator.wikimedia.org/T76115#963932 (10cscott) Here's the 21 day history of the status queue length: {F27004} It varies between ~600k and ~720k in normal operation. I believe #ops has some temporary server reshuffling which... [19:15:40] paravoid: ^ [19:15:43] Reedy: I filed some other bug for the nc stuff [19:15:51] I thought I'd seen it before [19:15:56] Reedy: The pub key problem is still this -- https://gerrit.wikimedia.org/r/#/c/176750/3 [19:16:27] you mean, because we reverted it? [19:16:36] (because it didn't work) [19:17:27] https://phabricator.wikimedia.org/T1387 for nc [19:19:03] Reedy: Yeah the root problem is still there (l10nupdate can't read from the shared ssh-agent due to ownership permissions) [19:20:19] (03CR) 10BBlack: [C: 032 V: 032] use $::memorysize_mb in cache.pp [puppet] - 10https://gerrit.wikimedia.org/r/183556 (owner: 10BBlack) [19:21:12] 3operations, Release-Engineering: /usr/local/bin/deploy2graphite broken on tin due to nc command syntax - https://phabricator.wikimedia.org/T1387#963965 (10Reedy) ``` reedy@ubuntu64-web-esxi:~/git/operations/puppet$ grep -R MW_STATSD_HOST * files/misc/scripts/deploy2graphite: echo "deploy.${1}:1|c" |... [19:22:01] (03PS1) 10Reedy: Make deploy2graphite use mw-deployment-vars.sh [puppet] - 10https://gerrit.wikimedia.org/r/183568 [19:22:40] That seems too easy [19:24:18] Why did wikibugs not link to the change? [19:24:22] *task [19:24:33] Reedy: which one? [19:24:44] [19:21:12] operations, Release-Engineering: /usr/local/bin/deploy2graphite broken on tin due to nc command syntax - https://phabricator.wikimedia.org/T1387#963965 (Reedy) ``` reedy@ubuntu64-web-esxi:~/git/operations/puppet$ grep -R MW_STATSD_HOST * files/misc/scripts/deploy2graphite: echo "deploy.${1}:1|c" |... [19:24:52] oh, it did [19:24:55] I'm blind apparently [19:25:02] probably with the highlight [19:25:09] vmware? [19:25:25] heh [19:25:27] yeah [19:25:45] (03PS2) 10Reedy: Make deploy2graphite use mw-deployment-vars.sh [puppet] - 10https://gerrit.wikimedia.org/r/183568 (https://phabricator.wikimedia.org/T1387) [19:25:46] !log reedy Synchronized php-1.25wmf14/cache/l10n: (no message) (duration: 00m 01s) [19:25:47] TT? fail [19:25:52] Logged the message, Master [19:25:56] 3operations, Release-Engineering: /usr/local/bin/deploy2graphite broken on tin due to nc command syntax - https://phabricator.wikimedia.org/T1387#963987 (10Reedy) https://gerrit.wikimedia.org/r/#/c/183568/ [19:26:46] !log LocalisationUpdate completed (1.25wmf14) at 2015-01-08 19:26:46+00:00 [19:26:49] Logged the message, Master [19:27:22] (03CR) 10BBlack: [C: 032 V: 032] switch min_free_kbytes to $::memorysize_mb [puppet] - 10https://gerrit.wikimedia.org/r/183557 (owner: 10BBlack) [19:27:43] bblack: [19:27:48] and kill the fact? [19:28:12] it's coming, but I figured I'd separate them by a puppet cycle, in case of weird puppet dep issues between facts -> manifests [19:28:23] nah, facts are always synced before manifests [19:28:45] * bblack has a natural fear of puppet insanity [19:29:02] plugins are synced -> facter runs -> agent sends output to master -> master uses this as input for catalog generation and sents it back to agent [19:29:14] that used to not be the case before our most recent puppet upgrades [19:29:25] I broke puppet several times due to fact-sync issues, esp on new nodes [19:29:44] well you have to pass pluginsync for this to happen [19:29:48] :) [19:29:50] which we set in puppet.conf with puppet [19:30:07] but if you run the initial run with --pluginsync it should work regardless [19:30:10] yeah that was the workaround, for a while I had to remember to explicitly --pluginsync [19:30:23] but it's so broken that that isn't default behavior [19:30:25] :) [19:31:25] (03CR) 10BBlack: [C: 032 V: 032] remove custom memorysizeinbytes fact (no longer used) [puppet] - 10https://gerrit.wikimedia.org/r/183558 (owner: 10BBlack) [19:31:27] (03PS1) 10Faidon Liambotis: ocg: bump alert thresholds [puppet] - 10https://gerrit.wikimedia.org/r/183610 (https://phabricator.wikimedia.org/T76115) [19:31:35] (03PS2) 10Faidon Liambotis: ocg: bump alert thresholds [puppet] - 10https://gerrit.wikimedia.org/r/183610 (https://phabricator.wikimedia.org/T76115) [19:31:39] PROBLEM - Varnishkafka Delivery Errors per minute on cp3017 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [20000.0] [19:31:41] (03CR) 10Faidon Liambotis: [C: 032 V: 032] ocg: bump alert thresholds [puppet] - 10https://gerrit.wikimedia.org/r/183610 (https://phabricator.wikimedia.org/T76115) (owner: 10Faidon Liambotis) [19:31:46] (03PS5) 10QChris: Install maven on analytics clients [puppet] - 10https://gerrit.wikimedia.org/r/170668 [19:32:10] gerrit/jenkins is crawling now [19:34:16] paravoid: I merged yours [19:34:26] fatal: Unable to create '/var/lib/git/operations/puppet/.git/refs/remotes/origin/production.lock': File exists. [19:34:29] puppet-merge took so long to run that yours landed by then [19:34:29] heh :) [19:36:12] nice, puppet runs on cp1008 now as well as can be expected, that was the only real hangup [19:36:31] failing on all the missing custom packages for vk, etc of course [19:37:32] (03PS6) 10Ori.livneh: varnish: Route requests with 'X-Wikimedia-Debug=1' to test_wikipedia backend [puppet] - 10https://gerrit.wikimedia.org/r/183171 [19:37:57] (03CR) 10QChris: "> Make a maven puppet module?" [puppet] - 10https://gerrit.wikimedia.org/r/170668 (owner: 10QChris) [19:38:07] !log reedy Synchronized php-1.25wmf13/cache/l10n/: l10nupdate (duration: 03m 54s) [19:38:10] Logged the message, Master [19:38:16] (03CR) 10Ori.livneh: "@bblack: I think updating the hash with X-Wikimedia-Debug in vcl_hash (if and only if the header is present) takes care of that." [puppet] - 10https://gerrit.wikimedia.org/r/183171 (owner: 10Ori.livneh) [19:38:59] RECOVERY - OCG health on ocg1002 is OK: OK: ocg_job_status 694478 msg: ocg_render_job_queue 0 msg [19:39:19] PROBLEM - Varnishkafka Delivery Errors per minute on cp3018 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [20000.0] [19:39:26] !log reedy Synchronized php-1.25wmf13/cache/l10n/: (no message) (duration: 00m 05s) [19:39:29] Logged the message, Master [19:40:52] PROBLEM - Varnishkafka Delivery Errors per minute on cp3021 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [20000.0] [19:41:12] RECOVERY - Varnishkafka Delivery Errors per minute on cp3017 is OK: OK: Less than 1.00% above the threshold [0.0] [19:41:32] RECOVERY - OCG health on ocg1001 is OK: OK: ocg_job_status 694880 msg: ocg_render_job_queue 0 msg [19:42:22] PROBLEM - Varnishkafka Delivery Errors per minute on cp3019 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [20000.0] [19:43:26] !log reedy Synchronized php-1.25wmf14/cache/l10n/: l10nupdate (duration: 03m 39s) [19:43:30] Logged the message, Master [19:44:28] !log running dsh -g mediawiki-installation -M -F 40 -- "sudo -u mwdeploy /srv/deployment/scap/scap/bin/scap-rebuild-cdbs" [19:44:31] Logged the message, Master [19:45:42] RECOVERY - Varnishkafka Delivery Errors per minute on cp3018 is OK: OK: Less than 1.00% above the threshold [0.0] [19:47:57] !log scap-rebuild-cdbs finished [19:48:01] Logged the message, Master [19:48:19] stupid puppet is stupid [19:49:05] Duplicate definition: stupid is already defined; cannot redefine. [19:49:11] RECOVERY - Varnishkafka Delivery Errors per minute on cp3021 is OK: OK: Less than 1.00% above the threshold [0.0] [19:50:51] RECOVERY - Varnishkafka Delivery Errors per minute on cp3019 is OK: OK: Less than 1.00% above the threshold [0.0] [19:52:06] bd808|LUNCH: this isn't good either [19:52:07] reedy@tin:/srv/deployment/scap/scap$ git log [19:52:07] fatal: Not a git repository (or any of the parent directories): .git [19:52:07] reedy@tin:/srv/deployment/scap/scap$ [19:52:21] RECOVERY - OCG health on ocg1003 is OK: OK: ocg_job_status 696553 msg: ocg_render_job_queue 0 msg [19:54:05] ori: hah, you were actually very close [19:54:05] OF [19:54:33] so, sshkey uses namevar for ssh's hostname key [19:54:48] but in ssh, the hostname isn't unique, the combination of (hostname, type) is [19:55:09] but puppet's namevar is unique, the parser barfs [19:55:17] so there's no way to define two different key types [19:55:55] 3operations, OCG-General-or-Unknown: OCG Queue Length Checks are unclear - https://phabricator.wikimedia.org/T76115#964063 (10faidon) 5Open>3Resolved [19:55:57] wtf puppet, wtf [19:59:31] PROBLEM - Varnishkafka Delivery Errors per minute on cp3019 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [20000.0] [19:59:42] PROBLEM - puppet last run on db1027 is CRITICAL: CRITICAL: puppet fail [20:00:52] RECOVERY - puppet last run on db1027 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [20:09:02] RECOVERY - Varnishkafka Delivery Errors per minute on cp3019 is OK: OK: Less than 1.00% above the threshold [0.0] [20:20:27] (03PS1) 10Faidon Liambotis: ssh: add $::ipaddress6 to sshkey's aliases [puppet] - 10https://gerrit.wikimedia.org/r/183621 [20:20:29] (03PS1) 10Faidon Liambotis: ssh: clarify comments surrounding ed25519 keys [puppet] - 10https://gerrit.wikimedia.org/r/183622 [20:20:52] (03CR) 10Faidon Liambotis: [C: 032] ssh: add $::ipaddress6 to sshkey's aliases [puppet] - 10https://gerrit.wikimedia.org/r/183621 (owner: 10Faidon Liambotis) [20:21:04] three puppet bugs in one piece of code [20:21:05] amazing [20:21:18] it's like puppet 2.6 again! [20:22:10] (03CR) 10Faidon Liambotis: [C: 032] ssh: clarify comments surrounding ed25519 keys [puppet] - 10https://gerrit.wikimedia.org/r/183622 (owner: 10Faidon Liambotis) [20:22:53] !log moved /srv/deployment/scap to scap.old as git repo seems busted. Hoping puppet puts it back again correctly... [20:22:57] Logged the message, Master [20:23:46] the good news are that when we get rid of precise AND we upgrade puppet on trusty to something newer (e.g. 3.7) these will be gone/worked around [20:23:57] considering we still use lucid in three boxes... [20:25:01] $ egrep -v '(^amssq|^cp|^search|^tmh|^tin|^fluorine)' by-dist |grep -c precise [20:25:04] 215 [20:25:26] oh, ^mw too I guess [20:25:28] 199 [20:25:42] 3operations, Engineering-Community, WMF-Legal: Implement the Volunteer NDA process in Phabricator - https://phabricator.wikimedia.org/T655#964152 (10chasemp) So we have pretty much phased out RT and the existing process in the wiki mandated queues that are not in use anymore. I updated (unilaterial) the page he... [20:25:52] PROBLEM - puppet last run on mw1200 is CRITICAL: CRITICAL: Puppet has 1 failures [20:25:52] PROBLEM - puppet last run on mw1082 is CRITICAL: CRITICAL: Puppet has 1 failures [20:25:52] PROBLEM - puppet last run on mw1063 is CRITICAL: CRITICAL: Puppet has 1 failures [20:25:52] PROBLEM - puppet last run on mw1106 is CRITICAL: CRITICAL: Puppet has 1 failures [20:26:11] PROBLEM - puppet last run on mw1041 is CRITICAL: CRITICAL: Puppet has 1 failures [20:26:11] PROBLEM - puppet last run on mw1224 is CRITICAL: CRITICAL: Puppet has 1 failures [20:26:11] PROBLEM - puppet last run on mw1160 is CRITICAL: CRITICAL: Puppet has 1 failures [20:26:12] PROBLEM - puppet last run on mw1026 is CRITICAL: CRITICAL: Puppet has 1 failures [20:26:22] PROBLEM - puppet last run on mw1187 is CRITICAL: CRITICAL: Puppet has 1 failures [20:26:32] PROBLEM - puppet last run on mw1250 is CRITICAL: CRITICAL: Puppet has 1 failures [20:26:32] PROBLEM - puppet last run on mw1140 is CRITICAL: CRITICAL: Puppet has 1 failures [20:26:32] PROBLEM - puppet last run on mw1254 is CRITICAL: CRITICAL: Puppet has 1 failures [20:26:51] PROBLEM - puppet last run on mw1174 is CRITICAL: CRITICAL: Puppet has 1 failures [20:26:52] PROBLEM - puppet last run on mw1197 is CRITICAL: CRITICAL: Puppet has 1 failures [20:26:52] PROBLEM - puppet last run on mw1141 is CRITICAL: CRITICAL: Puppet has 1 failures [20:26:52] PROBLEM - puppet last run on snapshot1003 is CRITICAL: CRITICAL: Puppet has 1 failures [20:26:52] PROBLEM - puppet last run on mw1145 is CRITICAL: CRITICAL: Puppet has 1 failures [20:26:52] PROBLEM - puppet last run on mw1226 is CRITICAL: CRITICAL: Puppet has 1 failures [20:27:02] PROBLEM - puppet last run on mw1242 is CRITICAL: CRITICAL: Puppet has 1 failures [20:27:03] PROBLEM - puppet last run on mw1088 is CRITICAL: CRITICAL: Puppet has 1 failures [20:27:22] PROBLEM - puppet last run on mw1099 is CRITICAL: CRITICAL: Puppet has 1 failures [20:27:32] PROBLEM - puppet last run on mw1164 is CRITICAL: CRITICAL: Puppet has 1 failures [20:27:41] PROBLEM - puppet last run on mw1117 is CRITICAL: CRITICAL: Puppet has 1 failures [20:27:41] PROBLEM - puppet last run on mw1205 is CRITICAL: CRITICAL: Puppet has 1 failures [20:27:41] PROBLEM - puppet last run on mw1189 is CRITICAL: CRITICAL: Puppet has 1 failures [20:27:42] PROBLEM - puppet last run on mw1009 is CRITICAL: CRITICAL: Puppet has 1 failures [20:27:51] PROBLEM - puppet last run on mw1068 is CRITICAL: CRITICAL: Puppet has 1 failures [20:27:51] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [20:27:52] PROBLEM - puppet last run on mw1100 is CRITICAL: CRITICAL: Puppet has 1 failures [20:27:52] PROBLEM - puppet last run on mw1173 is CRITICAL: CRITICAL: Puppet has 1 failures [20:28:01] PROBLEM - puppet last run on mw1217 is CRITICAL: CRITICAL: Puppet has 1 failures [20:28:02] PROBLEM - puppet last run on mw1008 is CRITICAL: CRITICAL: Puppet has 1 failures [20:28:02] PROBLEM - puppet last run on mw1069 is CRITICAL: CRITICAL: Puppet has 1 failures [20:28:11] PROBLEM - puppet last run on mw1228 is CRITICAL: CRITICAL: Puppet has 1 failures [20:28:11] PROBLEM - puppet last run on mw1003 is CRITICAL: CRITICAL: Puppet has 1 failures [20:28:11] PROBLEM - puppet last run on mw1052 is CRITICAL: CRITICAL: Puppet has 1 failures [20:28:12] PROBLEM - puppet last run on mw1222 is CRITICAL: CRITICAL: Puppet has 1 failures [20:28:12] PROBLEM - puppet last run on mw1176 is CRITICAL: CRITICAL: Puppet has 1 failures [20:28:22] PROBLEM - puppet last run on mw1118 is CRITICAL: CRITICAL: Puppet has 1 failures [20:28:32] PROBLEM - puppet last run on mw1235 is CRITICAL: CRITICAL: Puppet has 1 failures [20:28:32] PROBLEM - puppet last run on mw1060 is CRITICAL: CRITICAL: Puppet has 1 failures [20:28:32] PROBLEM - puppet last run on mw1065 is CRITICAL: CRITICAL: Puppet has 1 failures [20:28:32] PROBLEM - puppet last run on mw1150 is CRITICAL: CRITICAL: Puppet has 1 failures [20:28:32] PROBLEM - puppet last run on mw1046 is CRITICAL: CRITICAL: Puppet has 1 failures [20:28:42] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: Puppet has 1 failures [20:28:42] PROBLEM - puppet last run on mw1120 is CRITICAL: CRITICAL: Puppet has 1 failures [20:28:42] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [20:28:42] PROBLEM - puppet last run on mw1153 is CRITICAL: CRITICAL: Puppet has 1 failures [20:28:52] PROBLEM - puppet last run on mw1251 is CRITICAL: CRITICAL: Puppet has 1 failures [20:29:01] PROBLEM - puppet last run on mw1042 is CRITICAL: CRITICAL: Puppet has 1 failures [20:29:11] PROBLEM - puppet last run on mw1144 is CRITICAL: CRITICAL: Puppet has 1 failures [20:29:11] PROBLEM - puppet last run on mw1129 is CRITICAL: CRITICAL: Puppet has 1 failures [20:29:12] PROBLEM - puppet last run on mw1025 is CRITICAL: CRITICAL: Puppet has 1 failures [20:29:12] PROBLEM - puppet last run on mw1126 is CRITICAL: CRITICAL: Puppet has 1 failures [20:29:12] PROBLEM - puppet last run on mw1092 is CRITICAL: CRITICAL: Puppet has 1 failures [20:29:12] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [20:29:22] PROBLEM - puppet last run on mw1175 is CRITICAL: CRITICAL: Puppet has 1 failures [20:29:23] PROBLEM - puppet last run on mw1177 is CRITICAL: CRITICAL: Puppet has 1 failures [20:29:23] PROBLEM - puppet last run on mw1172 is CRITICAL: CRITICAL: Puppet has 1 failures [20:29:23] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures [20:29:31] PROBLEM - puppet last run on mw1114 is CRITICAL: CRITICAL: Puppet has 1 failures [20:29:52] PROBLEM - puppet last run on mw1002 is CRITICAL: CRITICAL: Puppet has 1 failures [20:30:02] PROBLEM - puppet last run on mw1211 is CRITICAL: CRITICAL: Puppet has 1 failures [20:30:02] PROBLEM - puppet last run on mw1213 is CRITICAL: CRITICAL: Puppet has 1 failures [20:30:12] PROBLEM - puppet last run on mw1011 is CRITICAL: CRITICAL: Puppet has 1 failures [20:30:21] PROBLEM - puppet last run on mw1208 is CRITICAL: CRITICAL: Puppet has 1 failures [20:30:21] PROBLEM - puppet last run on mw1149 is CRITICAL: CRITICAL: Puppet has 1 failures [20:30:31] PROBLEM - puppet last run on mw1039 is CRITICAL: CRITICAL: Puppet has 1 failures [20:30:31] PROBLEM - puppet last run on mw1054 is CRITICAL: CRITICAL: Puppet has 1 failures [20:30:32] PROBLEM - puppet last run on mw1206 is CRITICAL: CRITICAL: Puppet has 1 failures [20:30:32] PROBLEM - puppet last run on mw1076 is CRITICAL: CRITICAL: Puppet has 1 failures [20:30:32] PROBLEM - puppet last run on mw1237 is CRITICAL: CRITICAL: Puppet has 1 failures [20:30:32] PROBLEM - puppet last run on mw1055 is CRITICAL: CRITICAL: Puppet has 1 failures [20:30:51] PROBLEM - puppet last run on mw1162 is CRITICAL: CRITICAL: Puppet has 1 failures [20:31:02] PROBLEM - puppet last run on mw1195 is CRITICAL: CRITICAL: Puppet has 1 failures [20:31:02] PROBLEM - puppet last run on mw1044 is CRITICAL: CRITICAL: Puppet has 1 failures [20:31:19] PROBLEM - puppet last run on snapshot1002 is CRITICAL: CRITICAL: Puppet has 1 failures [20:31:19] PROBLEM - puppet last run on mw1227 is CRITICAL: CRITICAL: Puppet has 1 failures [20:31:21] PROBLEM - puppet last run on mw1249 is CRITICAL: CRITICAL: Puppet has 1 failures [20:31:22] PROBLEM - puppet last run on mw1180 is CRITICAL: CRITICAL: Puppet has 1 failures [20:31:22] PROBLEM - puppet last run on mw1049 is CRITICAL: CRITICAL: Puppet has 1 failures [20:31:31] PROBLEM - puppet last run on snapshot1001 is CRITICAL: CRITICAL: Puppet has 1 failures [20:31:32] PROBLEM - puppet last run on mw1247 is CRITICAL: CRITICAL: Puppet has 1 failures [20:31:41] PROBLEM - puppet last run on mw1014 is CRITICAL: CRITICAL: Puppet has 1 failures [20:31:41] PROBLEM - puppet last run on mw1051 is CRITICAL: CRITICAL: Puppet has 1 failures [20:31:42] PROBLEM - puppet last run on mw1151 is CRITICAL: CRITICAL: Puppet has 1 failures [20:31:42] PROBLEM - puppet last run on mw1079 is CRITICAL: CRITICAL: Puppet has 1 failures [20:31:42] PROBLEM - puppet last run on mw1111 is CRITICAL: CRITICAL: Puppet has 1 failures [20:31:51] PROBLEM - puppet last run on mw1238 is CRITICAL: CRITICAL: Puppet has 1 failures [20:31:52] PROBLEM - puppet last run on mw1084 is CRITICAL: CRITICAL: Puppet has 1 failures [20:31:52] PROBLEM - puppet last run on mw1133 is CRITICAL: CRITICAL: Puppet has 1 failures [20:31:52] PROBLEM - puppet last run on mw1050 is CRITICAL: CRITICAL: Puppet has 1 failures [20:32:11] PROBLEM - puppet last run on mw1168 is CRITICAL: CRITICAL: Puppet has 1 failures [20:32:21] PROBLEM - puppet last run on mw1098 is CRITICAL: CRITICAL: Puppet has 1 failures [20:32:22] PROBLEM - puppet last run on mw1030 is CRITICAL: CRITICAL: Puppet has 1 failures [20:32:22] PROBLEM - puppet last run on mw1081 is CRITICAL: CRITICAL: Puppet has 1 failures [20:32:22] PROBLEM - puppet last run on mw1258 is CRITICAL: CRITICAL: Puppet has 1 failures [20:32:32] PROBLEM - puppet last run on mw1004 is CRITICAL: CRITICAL: Puppet has 1 failures [20:32:42] PROBLEM - puppet last run on mw1125 is CRITICAL: CRITICAL: Puppet has 1 failures [20:32:42] PROBLEM - puppet last run on snapshot1004 is CRITICAL: CRITICAL: Puppet has 1 failures [20:32:42] PROBLEM - puppet last run on mw1034 is CRITICAL: CRITICAL: Puppet has 1 failures [20:32:42] PROBLEM - puppet last run on mw1202 is CRITICAL: CRITICAL: Puppet has 1 failures [20:32:42] PROBLEM - puppet last run on mw1190 is CRITICAL: CRITICAL: Puppet has 1 failures [20:32:43] PROBLEM - puppet last run on mw1159 is CRITICAL: CRITICAL: Puppet has 1 failures [20:32:51] PROBLEM - puppet last run on mw1156 is CRITICAL: CRITICAL: Puppet has 1 failures [20:32:51] PROBLEM - puppet last run on mw1146 is CRITICAL: CRITICAL: Puppet has 1 failures [20:33:01] PROBLEM - puppet last run on mw1057 is CRITICAL: CRITICAL: Puppet has 1 failures [20:33:02] PROBLEM - puppet last run on mw1097 is CRITICAL: CRITICAL: Puppet has 1 failures [20:33:02] PROBLEM - puppet last run on mw1181 is CRITICAL: CRITICAL: Puppet has 1 failures [20:33:11] PROBLEM - puppet last run on mw1056 is CRITICAL: CRITICAL: Puppet has 1 failures [20:33:11] PROBLEM - puppet last run on mw1210 is CRITICAL: CRITICAL: Puppet has 1 failures [20:33:11] PROBLEM - puppet last run on mw1074 is CRITICAL: CRITICAL: Puppet has 1 failures [20:33:21] PROBLEM - puppet last run on mw1165 is CRITICAL: CRITICAL: Puppet has 1 failures [20:33:22] PROBLEM - puppet last run on mw1248 is CRITICAL: CRITICAL: Puppet has 1 failures [20:33:31] PROBLEM - puppet last run on mw1023 is CRITICAL: CRITICAL: Puppet has 1 failures [20:33:32] PROBLEM - puppet last run on mw1053 is CRITICAL: CRITICAL: Puppet has 1 failures [20:33:32] PROBLEM - puppet last run on mw1029 is CRITICAL: CRITICAL: Puppet has 1 failures [20:33:35] PROBLEM - icinga-wm has many failures [20:33:41] PROBLEM - puppet last run on mw1212 is CRITICAL: CRITICAL: Puppet has 1 failures [20:33:52] PROBLEM - puppet last run on mw1163 is CRITICAL: CRITICAL: Puppet has 1 failures [20:33:52] PROBLEM - puppet last run on mw1183 is CRITICAL: CRITICAL: Puppet has 1 failures [20:33:52] PROBLEM - puppet last run on mw1087 is CRITICAL: CRITICAL: Puppet has 1 failures [20:34:02] PROBLEM - puppet last run on mw1188 is CRITICAL: CRITICAL: Puppet has 1 failures [20:34:02] PROBLEM - puppet last run on mw1171 is CRITICAL: CRITICAL: Puppet has 1 failures [20:34:02] PROBLEM - puppet last run on mw1198 is CRITICAL: CRITICAL: Puppet has 1 failures [20:34:22] PROBLEM - puppet last run on mw1116 is CRITICAL: CRITICAL: Puppet has 1 failures [20:34:32] PROBLEM - puppet last run on mw1032 is CRITICAL: CRITICAL: Puppet has 1 failures [20:34:32] PROBLEM - puppet last run on mw1243 is CRITICAL: CRITICAL: Puppet has 1 failures [20:34:42] PROBLEM - puppet last run on virt1000 is CRITICAL: CRITICAL: Puppet has 1 failures [20:34:42] PROBLEM - puppet last run on mw1148 is CRITICAL: CRITICAL: Puppet has 1 failures [20:34:51] PROBLEM - puppet last run on mw1167 is CRITICAL: CRITICAL: Puppet has 1 failures [20:34:51] PROBLEM - puppet last run on mw1239 is CRITICAL: CRITICAL: Puppet has 1 failures [20:34:51] PROBLEM - puppet last run on mw1001 is CRITICAL: CRITICAL: Puppet has 1 failures [20:35:02] PROBLEM - puppet last run on mw1225 is CRITICAL: CRITICAL: Puppet has 1 failures [20:35:12] PROBLEM - puppet last run on mw1022 is CRITICAL: CRITICAL: Puppet has 1 failures [20:35:12] PROBLEM - puppet last run on mw1186 is CRITICAL: CRITICAL: Puppet has 1 failures [20:35:21] PROBLEM - puppet last run on mw1033 is CRITICAL: CRITICAL: Puppet has 1 failures [20:35:22] PROBLEM - puppet last run on mw1024 is CRITICAL: CRITICAL: Puppet has 1 failures [20:35:31] PROBLEM - puppet last run on mw1108 is CRITICAL: CRITICAL: Puppet has 1 failures [20:35:32] PROBLEM - puppet last run on mw1043 is CRITICAL: CRITICAL: Puppet has 1 failures [20:35:32] PROBLEM - puppet last run on mw1201 is CRITICAL: CRITICAL: Puppet has 1 failures [20:35:42] PROBLEM - puppet last run on mw1229 is CRITICAL: CRITICAL: Puppet has 1 failures [20:35:43] PROBLEM - puppet last run on mw1077 is CRITICAL: CRITICAL: Puppet has 1 failures [20:35:51] PROBLEM - puppet last run on mw1185 is CRITICAL: CRITICAL: Puppet has 1 failures [20:35:52] PROBLEM - puppet last run on mw1122 is CRITICAL: CRITICAL: Puppet has 1 failures [20:35:52] PROBLEM - puppet last run on mw1139 is CRITICAL: CRITICAL: Puppet has 1 failures [20:35:52] PROBLEM - puppet last run on mw1231 is CRITICAL: CRITICAL: Puppet has 1 failures [20:35:58] so uh what's going on there? [20:36:01] PROBLEM - puppet last run on mw1121 is CRITICAL: CRITICAL: Puppet has 1 failures [20:36:02] PROBLEM - puppet last run on mw1016 is CRITICAL: CRITICAL: Puppet has 1 failures [20:36:11] PROBLEM - puppet last run on mw1209 is CRITICAL: CRITICAL: Puppet has 1 failures [20:36:12] PROBLEM - puppet last run on mw1105 is CRITICAL: CRITICAL: Puppet has 1 failures [20:36:21] PROBLEM - puppet last run on mw1223 is CRITICAL: CRITICAL: Puppet has 1 failures [20:36:22] PROBLEM - puppet last run on mw1219 is CRITICAL: CRITICAL: Puppet has 1 failures [20:36:32] PROBLEM - puppet last run on mw1152 is CRITICAL: CRITICAL: Puppet has 1 failures [20:37:01] PROBLEM - puppet last run on mw1086 is CRITICAL: CRITICAL: Puppet has 1 failures [20:37:02] PROBLEM - puppet last run on mw1220 is CRITICAL: CRITICAL: Puppet has 1 failures [20:37:11] PROBLEM - puppet last run on mw1236 is CRITICAL: CRITICAL: Puppet has 1 failures [20:37:12] PROBLEM - puppet last run on mw1093 is CRITICAL: CRITICAL: Puppet has 1 failures [20:37:12] PROBLEM - puppet last run on mw1091 is CRITICAL: CRITICAL: Puppet has 1 failures [20:37:12] PROBLEM - puppet last run on mw1143 is CRITICAL: CRITICAL: Puppet has 1 failures [20:37:22] PROBLEM - puppet last run on mw1010 is CRITICAL: CRITICAL: Puppet has 1 failures [20:37:22] PROBLEM - puppet last run on mw1071 is CRITICAL: CRITICAL: Puppet has 1 failures [20:37:31] PROBLEM - puppet last run on mw1142 is CRITICAL: CRITICAL: Puppet has 1 failures [20:37:41] PROBLEM - puppet last run on mw1112 is CRITICAL: CRITICAL: Puppet has 1 failures [20:37:41] PROBLEM - puppet last run on mw1255 is CRITICAL: CRITICAL: Puppet has 1 failures [20:37:41] PROBLEM - puppet last run on mw1241 is CRITICAL: CRITICAL: Puppet has 1 failures [20:37:42] PROBLEM - puppet last run on mw1027 is CRITICAL: CRITICAL: Puppet has 1 failures [20:37:51] PROBLEM - puppet last run on mw1090 is CRITICAL: CRITICAL: Puppet has 1 failures [20:37:51] PROBLEM - puppet last run on mw1204 is CRITICAL: CRITICAL: Puppet has 1 failures [20:37:51] PROBLEM - puppet last run on terbium is CRITICAL: CRITICAL: Puppet has 1 failures [20:37:52] PROBLEM - puppet last run on mw1107 is CRITICAL: CRITICAL: Puppet has 1 failures [20:38:01] PROBLEM - puppet last run on mw1066 is CRITICAL: CRITICAL: Puppet has 1 failures [20:38:01] PROBLEM - puppet last run on mw1064 is CRITICAL: CRITICAL: Puppet has 1 failures [20:38:11] PROBLEM - puppet last run on mw1203 is CRITICAL: CRITICAL: Puppet has 1 failures [20:38:21] PROBLEM - puppet last run on mw1215 is CRITICAL: CRITICAL: Puppet has 1 failures [20:38:32] PROBLEM - puppet last run on mw1207 is CRITICAL: CRITICAL: Puppet has 1 failures [20:38:32] PROBLEM - puppet last run on mw1253 is CRITICAL: CRITICAL: Puppet has 1 failures [20:38:41] PROBLEM - puppet last run on mw1193 is CRITICAL: CRITICAL: Puppet has 1 failures [20:38:41] PROBLEM - puppet last run on mw1113 is CRITICAL: CRITICAL: Puppet has 1 failures [20:38:42] PROBLEM - puppet last run on mw1131 is CRITICAL: CRITICAL: Puppet has 1 failures [20:38:42] PROBLEM - puppet last run on mw1021 is CRITICAL: CRITICAL: Puppet has 1 failures [20:38:51] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [20:38:52] PROBLEM - puppet last run on mw1110 is CRITICAL: CRITICAL: Puppet has 1 failures [20:38:52] PROBLEM - puppet last run on mw1154 is CRITICAL: CRITICAL: Puppet has 1 failures [20:39:02] PROBLEM - puppet last run on mw1103 is CRITICAL: CRITICAL: Puppet has 1 failures [20:39:02] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [20:39:11] PROBLEM - puppet last run on mw1073 is CRITICAL: CRITICAL: Puppet has 1 failures [20:39:11] PROBLEM - puppet last run on mw1037 is CRITICAL: CRITICAL: Puppet has 1 failures [20:39:23] PROBLEM - puppet last run on mw1104 is CRITICAL: CRITICAL: Puppet has 1 failures [20:39:32] !log Scap deployed at a78ddec [20:39:33] PROBLEM - puppet last run on tmh1001 is CRITICAL: CRITICAL: Puppet has 1 failures [20:39:33] PROBLEM - puppet last run on mw1155 is CRITICAL: CRITICAL: Puppet has 1 failures [20:39:38] Logged the message, Master [20:39:42] PROBLEM - puppet last run on mw1194 is CRITICAL: CRITICAL: Puppet has 1 failures [20:39:52] PROBLEM - puppet last run on mw1199 is CRITICAL: CRITICAL: Puppet has 1 failures [20:39:52] PROBLEM - puppet last run on mw1047 is CRITICAL: CRITICAL: Puppet has 1 failures [20:39:52] PROBLEM - puppet last run on mw1018 is CRITICAL: CRITICAL: Puppet has 1 failures [20:40:02] PROBLEM - puppet last run on mw1128 is CRITICAL: CRITICAL: Puppet has 1 failures [20:40:11] PROBLEM - puppet last run on mw1075 is CRITICAL: CRITICAL: Puppet has 1 failures [20:40:12] PROBLEM - puppet last run on mw1019 is CRITICAL: CRITICAL: Puppet has 1 failures [20:40:22] PROBLEM - puppet last run on mw1020 is CRITICAL: CRITICAL: Puppet has 1 failures [20:40:23] PROBLEM - puppet last run on mw1137 is CRITICAL: CRITICAL: Puppet has 1 failures [20:40:32] PROBLEM - puppet last run on mw1078 is CRITICAL: CRITICAL: Puppet has 1 failures [20:40:43] PROBLEM - puppet last run on mw1070 is CRITICAL: CRITICAL: Puppet has 1 failures [20:40:43] PROBLEM - puppet last run on mw1095 is CRITICAL: CRITICAL: Puppet has 1 failures [20:40:43] PROBLEM - puppet last run on mw1015 is CRITICAL: CRITICAL: Puppet has 1 failures [20:40:52] PROBLEM - puppet last run on mw1102 is CRITICAL: CRITICAL: Puppet has 1 failures [20:40:52] PROBLEM - puppet last run on mw1058 is CRITICAL: CRITICAL: Puppet has 1 failures [20:40:52] PROBLEM - puppet last run on mw1230 is CRITICAL: CRITICAL: Puppet has 1 failures [20:41:01] PROBLEM - puppet last run on mw1101 is CRITICAL: CRITICAL: Puppet has 1 failures [20:41:11] RECOVERY - puppet last run on mw1011 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [20:41:12] PROBLEM - puppet last run on mw1191 is CRITICAL: CRITICAL: Puppet has 1 failures [20:41:12] PROBLEM - puppet last run on mw1017 is CRITICAL: CRITICAL: Puppet has 1 failures [20:41:22] PROBLEM - puppet last run on mw1085 is CRITICAL: CRITICAL: Puppet has 1 failures [20:41:22] PROBLEM - puppet last run on mw1157 is CRITICAL: CRITICAL: Puppet has 1 failures [20:41:31] PROBLEM - puppet last run on mw1179 is CRITICAL: CRITICAL: Puppet has 1 failures [20:41:32] PROBLEM - puppet last run on mw1182 is CRITICAL: CRITICAL: Puppet has 1 failures [20:41:52] PROBLEM - puppet last run on mw1169 is CRITICAL: CRITICAL: Puppet has 1 failures [20:42:12] PROBLEM - puppet last run on tmh1002 is CRITICAL: CRITICAL: Puppet has 1 failures [20:42:31] PROBLEM - puppet last run on mw1136 is CRITICAL: CRITICAL: Puppet has 1 failures [20:42:41] PROBLEM - Varnishkafka Delivery Errors per minute on cp3022 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [20000.0] [20:42:52] RECOVERY - puppet last run on mw1200 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [20:42:52] RECOVERY - puppet last run on mw1106 is OK: OK: Puppet is currently enabled, last run 0 seconds ago with 0 failures [20:43:02] RECOVERY - puppet last run on mw1041 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [20:43:31] RECOVERY - puppet last run on mw1254 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [20:43:31] RECOVERY - puppet last run on mw1250 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [20:43:31] RECOVERY - puppet last run on mw1140 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [20:43:51] RECOVERY - puppet last run on mw1174 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [20:43:51] RECOVERY - puppet last run on mw1197 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:43:51] RECOVERY - puppet last run on mw1141 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [20:43:52] RECOVERY - puppet last run on mw1157 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [20:43:52] RECOVERY - puppet last run on snapshot1003 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [20:43:52] RECOVERY - puppet last run on mw1145 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:44:02] RECOVERY - puppet last run on mw1063 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [20:44:03] RECOVERY - puppet last run on mw1082 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [20:44:21] RECOVERY - puppet last run on mw1224 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [20:44:22] RECOVERY - puppet last run on mw1160 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:44:31] RECOVERY - puppet last run on mw1026 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [20:44:31] RECOVERY - puppet last run on mw1187 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [20:44:42] RECOVERY - puppet last run on mw1068 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [20:44:51] RECOVERY - puppet last run on mw1100 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [20:44:52] RECOVERY - puppet last run on mw1173 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [20:44:52] RECOVERY - puppet last run on mw1217 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [20:45:01] RECOVERY - puppet last run on mw1008 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [20:45:12] RECOVERY - puppet last run on mw1226 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [20:45:12] RECOVERY - puppet last run on mw1228 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [20:45:12] RECOVERY - puppet last run on mw1003 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [20:45:12] RECOVERY - puppet last run on mw1222 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [20:45:12] RECOVERY - puppet last run on mw1242 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [20:45:13] RECOVERY - puppet last run on mw1176 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [20:45:13] RECOVERY - puppet last run on mw1088 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [20:45:32] RECOVERY - puppet last run on mw1060 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [20:45:32] RECOVERY - puppet last run on mw1150 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [20:45:32] RECOVERY - puppet last run on mw1099 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [20:45:32] RECOVERY - puppet last run on mw1046 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [20:45:41] RECOVERY - puppet last run on mw1120 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [20:45:42] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [20:45:42] RECOVERY - puppet last run on mw1153 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [20:45:51] RECOVERY - puppet last run on mw1164 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [20:45:51] RECOVERY - puppet last run on mw1117 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [20:45:51] RECOVERY - puppet last run on mw1205 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:45:51] RECOVERY - puppet last run on mw1189 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [20:45:52] RECOVERY - puppet last run on mw1009 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [20:46:01] RECOVERY - puppet last run on mw1042 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [20:46:02] RECOVERY - puppet last run on mw1144 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [20:46:11] RECOVERY - puppet last run on mw1025 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [20:46:12] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [20:46:12] RECOVERY - puppet last run on mw1092 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [20:46:21] RECOVERY - puppet last run on mw1069 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:46:22] RECOVERY - puppet last run on mw1052 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [20:46:22] RECOVERY - puppet last run on mw1177 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [20:46:22] RECOVERY - puppet last run on mw1172 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [20:46:22] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [20:46:41] RECOVERY - puppet last run on mw1118 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:46:42] RECOVERY - puppet last run on mw1235 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [20:46:51] RECOVERY - puppet last run on mw1065 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:46:51] RECOVERY - puppet last run on mw1123 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [20:46:52] RECOVERY - puppet last run on mw1002 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [20:47:01] RECOVERY - puppet last run on mw1211 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures [20:47:02] RECOVERY - puppet last run on mw1251 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [20:47:02] RECOVERY - puppet last run on mw1213 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [20:47:11] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:47:21] RECOVERY - puppet last run on mw1129 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [20:47:31] RECOVERY - puppet last run on mw1126 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [20:47:31] RECOVERY - puppet last run on mw1039 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:47:31] RECOVERY - puppet last run on mw1054 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [20:47:41] RECOVERY - puppet last run on mw1175 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:47:42] RECOVERY - puppet last run on mw1114 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [20:47:51] RECOVERY - puppet last run on mw1162 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [20:48:02] RECOVERY - puppet last run on mw1195 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [20:48:02] RECOVERY - puppet last run on mw1044 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [20:48:22] RECOVERY - puppet last run on mw1249 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [20:48:31] RECOVERY - puppet last run on mw1049 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [20:48:31] RECOVERY - puppet last run on snapshot1001 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [20:48:31] RECOVERY - puppet last run on mw1208 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [20:48:32] RECOVERY - puppet last run on mw1149 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [20:48:41] RECOVERY - puppet last run on mw1014 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [20:48:41] RECOVERY - puppet last run on mw1051 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [20:48:42] RECOVERY - puppet last run on mw1206 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:48:42] RECOVERY - puppet last run on mw1076 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:48:42] RECOVERY - puppet last run on mw1237 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:48:51] RECOVERY - puppet last run on mw1055 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:48:51] RECOVERY - puppet last run on mw1238 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [20:48:52] RECOVERY - puppet last run on mw1084 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [20:48:52] RECOVERY - puppet last run on mw1133 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [20:49:11] RECOVERY - puppet last run on mw1168 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [20:49:22] RECOVERY - puppet last run on mw1098 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [20:49:22] RECOVERY - puppet last run on mw1030 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [20:49:22] RECOVERY - puppet last run on snapshot1002 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [20:49:31] RECOVERY - puppet last run on mw1227 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [20:49:32] RECOVERY - puppet last run on mw1180 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [20:49:41] RECOVERY - puppet last run on mw1125 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:49:42] RECOVERY - puppet last run on mw1034 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [20:49:42] RECOVERY - puppet last run on mw1202 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:49:42] RECOVERY - puppet last run on mw1183 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [20:49:42] RECOVERY - puppet last run on mw1190 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:49:51] RECOVERY - puppet last run on mw1247 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:49:51] RECOVERY - puppet last run on mw1156 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [20:49:52] RECOVERY - puppet last run on mw1146 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [20:49:52] RECOVERY - puppet last run on mw1151 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:49:52] RECOVERY - puppet last run on mw1079 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [20:50:02] RECOVERY - puppet last run on mw1111 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:50:02] RECOVERY - puppet last run on mw1181 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [20:50:11] RECOVERY - puppet last run on mw1056 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [20:50:22] RECOVERY - puppet last run on mw1165 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [20:50:32] RECOVERY - puppet last run on mw1053 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [20:50:32] RECOVERY - puppet last run on mw1029 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [20:50:33] RECOVERY - puppet last run on mw1212 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [20:50:41] RECOVERY - puppet last run on mw1258 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [20:50:41] RECOVERY - puppet last run on mw1081 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [20:50:41] RECOVERY - puppet last run on mw1004 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [20:50:52] RECOVERY - puppet last run on snapshot1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:50:52] RECOVERY - puppet last run on mw1163 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [20:51:01] RECOVERY - puppet last run on mw1087 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [20:51:01] RECOVERY - puppet last run on mw1159 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [20:51:02] RECOVERY - puppet last run on mw1188 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [20:51:02] RECOVERY - puppet last run on mw1171 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [20:51:13] RECOVERY - puppet last run on mw1198 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [20:51:13] RECOVERY - puppet last run on mw1057 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [20:51:13] RECOVERY - puppet last run on mw1097 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [20:51:13] RECOVERY - puppet last run on mw1050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:51:21] RECOVERY - puppet last run on mw1210 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [20:51:22] RECOVERY - puppet last run on mw1074 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [20:51:31] RECOVERY - puppet last run on mw1116 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [20:51:32] RECOVERY - puppet last run on mw1248 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [20:51:41] RECOVERY - puppet last run on mw1023 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:51:51] RECOVERY - puppet last run on mw1239 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [20:52:02] RECOVERY - puppet last run on mw1225 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [20:52:21] RECOVERY - Varnishkafka Delivery Errors per minute on cp3022 is OK: OK: Less than 1.00% above the threshold [0.0] [20:52:31] RECOVERY - puppet last run on mw1108 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [20:52:32] RECOVERY - puppet last run on mw1043 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [20:52:41] RECOVERY - puppet last run on mw1229 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [20:52:42] RECOVERY - puppet last run on mw1032 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [20:52:51] RECOVERY - puppet last run on mw1243 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [20:52:52] RECOVERY - puppet last run on mw1185 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [20:52:52] RECOVERY - puppet last run on virt1000 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [20:53:01] RECOVERY - puppet last run on mw1148 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:53:02] RECOVERY - puppet last run on mw1121 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [20:53:02] RECOVERY - puppet last run on mw1167 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [20:53:02] RECOVERY - puppet last run on mw1001 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [20:53:13] RECOVERY - puppet last run on mw1105 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [20:53:31] RECOVERY - puppet last run on mw1022 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [20:53:31] RECOVERY - puppet last run on mw1186 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:53:32] RECOVERY - puppet last run on mw1033 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [20:53:32] RECOVERY - puppet last run on mw1024 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:53:52] RECOVERY - puppet last run on mw1201 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [20:54:01] RECOVERY - puppet last run on mw1077 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:54:11] RECOVERY - puppet last run on mw1122 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:54:12] RECOVERY - puppet last run on mw1016 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [20:54:12] RECOVERY - puppet last run on mw1093 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [20:54:13] RECOVERY - puppet last run on mw1091 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [20:54:21] RECOVERY - puppet last run on mw1209 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [20:54:22] RECOVERY - puppet last run on mw1010 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [20:54:31] RECOVERY - puppet last run on mw1142 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [20:54:32] RECOVERY - puppet last run on mw1223 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [20:54:41] RECOVERY - puppet last run on mw1219 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [20:54:41] RECOVERY - puppet last run on mw1241 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [20:54:42] RECOVERY - puppet last run on mw1152 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [20:55:11] PROBLEM - Varnishkafka Delivery Errors per minute on cp3019 is CRITICAL: CRITICAL: 37.50% of data above the critical threshold [20000.0] [20:55:11] RECOVERY - puppet last run on mw1203 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [20:55:12] RECOVERY - puppet last run on mw1086 is OK: OK: Puppet is currently enabled, last run 3 seconds ago with 0 failures [20:55:12] RECOVERY - puppet last run on mw1220 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [20:55:12] RECOVERY - puppet last run on mw1215 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [20:55:22] RECOVERY - puppet last run on mw1236 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [20:55:22] RECOVERY - puppet last run on mw1139 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:55:22] RECOVERY - puppet last run on mw1231 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:55:23] RECOVERY - puppet last run on mw1143 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [20:55:31] RECOVERY - puppet last run on mw1253 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [20:55:32] RECOVERY - puppet last run on mw1193 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [20:55:32] RECOVERY - puppet last run on mw1071 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [20:55:51] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [20:55:51] RECOVERY - puppet last run on mw1112 is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [20:55:51] RECOVERY - puppet last run on mw1255 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [20:55:52] RECOVERY - puppet last run on mw1027 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [20:56:01] RECOVERY - puppet last run on mw1090 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [20:56:01] RECOVERY - puppet last run on mw1204 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [20:56:01] RECOVERY - puppet last run on terbium is OK: OK: Puppet is currently enabled, last run 43 seconds ago with 0 failures [20:56:02] RECOVERY - puppet last run on mw1107 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [20:56:11] RECOVERY - puppet last run on mw1066 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [20:56:11] RECOVERY - puppet last run on mw1064 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:56:21] RECOVERY - puppet last run on mw1104 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [20:56:32] RECOVERY - puppet last run on tmh1001 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [20:56:32] RECOVERY - puppet last run on mw1155 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [20:56:41] RECOVERY - puppet last run on mw1207 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:56:41] RECOVERY - puppet last run on mw1199 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [20:56:42] RECOVERY - puppet last run on mw1047 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [20:56:51] RECOVERY - puppet last run on mw1113 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [20:56:52] RECOVERY - puppet last run on mw1131 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:56:52] RECOVERY - puppet last run on mw1021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:57:02] RECOVERY - puppet last run on mw1110 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:57:11] RECOVERY - puppet last run on mw1154 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:57:12] RECOVERY - puppet last run on mw1103 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [20:57:12] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [20:57:12] RECOVERY - puppet last run on mw1073 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [20:57:12] RECOVERY - puppet last run on mw1037 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:57:32] RECOVERY - puppet last run on mw1070 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [20:57:42] RECOVERY - puppet last run on mw1058 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [20:57:42] RECOVERY - puppet last run on mw1230 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [20:57:51] RECOVERY - puppet last run on mw1194 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [20:58:01] RECOVERY - puppet last run on mw1018 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:58:02] RECOVERY - puppet last run on mw1017 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [20:58:02] RECOVERY - puppet last run on mw1128 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:58:12] RECOVERY - puppet last run on mw1075 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [20:58:21] RECOVERY - puppet last run on mw1085 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [20:58:21] RECOVERY - puppet last run on mw1019 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [20:58:22] RECOVERY - puppet last run on mw1179 is OK: OK: Puppet is currently enabled, last run 24 seconds ago with 0 failures [20:58:22] RECOVERY - puppet last run on mw1020 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [20:58:22] RECOVERY - puppet last run on mw1182 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [20:58:22] RECOVERY - puppet last run on mw1137 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:58:32] RECOVERY - puppet last run on mw1078 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [20:58:51] RECOVERY - puppet last run on mw1095 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [20:58:51] RECOVERY - puppet last run on mw1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:58:52] RECOVERY - puppet last run on mw1102 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:59:02] RECOVERY - puppet last run on mw1101 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [20:59:02] RECOVERY - puppet last run on tmh1002 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [20:59:21] RECOVERY - puppet last run on mw1191 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [20:59:22] RECOVERY - puppet last run on mw1136 is OK: OK: Puppet is currently enabled, last run 51 seconds ago with 0 failures [21:00:06] are we done yet icinga-wm? [21:00:11] RECOVERY - puppet last run on mw1169 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:00:14] no [21:00:16] :) [21:00:27] !log reedy Synchronized wmf-config/InitialiseSettings.php: nooop to test scap update (duration: 00m 06s) [21:00:32] Logged the message, Master [21:01:21] Reedy: almost [21:02:31] RECOVERY - Varnishkafka Delivery Errors per minute on cp3019 is OK: OK: Less than 1.00% above the threshold [0.0] [21:04:19] !log reedy Synchronized wmf-config/InitialiseSettings.php: nooop to test scap update (duration: 00m 09s) [21:04:22] Logged the message, Master [21:06:02] 3WMF-Legal, operations, Engineering-Community: Implement the Volunteer NDA process in Phabricator - https://phabricator.wikimedia.org/T655#964236 (10Qgil) @chasemp, did you see the process described at https://wikitech.wikimedia.org/wiki/Talk:Volunteer_NDA ? Any reason to bypass that? [21:10:02] 3operations: Determine Trebuchet/git-deploy maintenance plan - https://phabricator.wikimedia.org/T85008#964243 (10Ryan_Lane) I'm more than happy to add WMF as maintainers. It's not unmaintained, but no one has been bugging me about changes/fixes (I made a fix a month ago or so on request). It's silly to start a... [21:10:27] ^d: andre__ : in your role as wikitech-l admins, you could help on T1003, by adding an email address to the allowed senders [21:10:48] that would let the list receive community metrics mail.. phab stats [21:12:20] mutante, what to paste under "List of non-member addresses whose postings should be automatically accepted." ? [21:12:25] 3WMF-Legal, operations, Engineering-Community: Implement the Volunteer NDA process in Phabricator - https://phabricator.wikimedia.org/T655#964250 (10chasemp) >>! In T655#964236, @Qgil wrote: > @chasemp, did you see the process described at https://wikitech.wikimedia.org/wiki/Talk:Volunteer_NDA ? Any reason to by... [21:12:38] andre__: communitymetrics@wikimedia.org [21:13:28] the sender address of that script (configured in puppet phab role) [21:13:30] mutante, added [21:13:42] cool, then the list should receive stats on Feb 1st [21:13:44] thx [21:13:52] as requested by Nemo and qgil [21:17:52] <^d> andre__: I wonder how that list got in its current state. [21:18:00] <^d> There's a ton of random posters in there. [21:19:00] ^d: I think you can add someone to the whitelist if you OK their email after it's held for moderation [21:19:27] (03PS1) 10Faidon Liambotis: ssh: remove unused variable _authorized_keys_file2 [puppet] - 10https://gerrit.wikimedia.org/r/183638 [21:19:49] ^d, I don't know either :-/ [21:19:52] lots of folks, yeah [21:19:59] <^d> That's an *ancient* list [21:20:13] (03CR) 10Faidon Liambotis: [C: 032] ssh: remove unused variable _authorized_keys_file2 [puppet] - 10https://gerrit.wikimedia.org/r/183638 (owner: 10Faidon Liambotis) [21:20:26] thanks mutante [21:25:19] (03CR) 10Dzahn: "@Hashar: I haven't used multiple channels for one logfile yet, but i found this example in modules/ircecho/manifests/init.pp. So yes, that" [puppet] - 10https://gerrit.wikimedia.org/r/183382 (https://phabricator.wikimedia.org/T86053) (owner: 10Hashar) [21:26:56] ^d: andre__ what valhallasw`cloud said. it's most likely that when somebody moderated pending mails they hit enter and the default is a checkbox is enabled that adds the senders permanently [21:27:09] vs. just for that specific mail [21:27:44] Nemo_bis: yw [21:28:33] In the mailman installs I remember, the checkbox was not enabled by default [21:28:49] Doesn't really matter though, after many years of operation it will be a mess :) [21:29:13] !log reedy Synchronized wmf-config/InitialiseSettings.php: nooop to test scap update (duration: 00m 06s) [21:29:22] true. and you could argue why they can't just subscribe like a regular user [21:29:43] mutante: or argue why not just trust them if they have shown to be good ;-) [21:30:11] bblack: thoughts re: the vcl_hash amendment to the x-wikimedia-debug patch? [21:31:00] valhallasw`cloud: or that:) yes [21:31:05] (03CR) 10Hoo man: Don't use logrotate for the wikidata dump logs (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/182173 (owner: 10Hoo man) [21:32:06] (03CR) 10Hoo man: "@Filippo: Yes, I guess making this more atomic is something we should consider for the future (and also make this smarted about catching f" [puppet] - 10https://gerrit.wikimedia.org/r/182173 (owner: 10Hoo man) [21:32:35] (03CR) 10Dzahn: [C: 031] Duplicate -qa notifcations to -releng [puppet] - 10https://gerrit.wikimedia.org/r/183382 (https://phabricator.wikimedia.org/T86053) (owner: 10Hashar) [21:33:13] (03PS2) 10Hoo man: Don't use logrotate for the wikidata dump logs [puppet] - 10https://gerrit.wikimedia.org/r/182173 [21:33:15] (03CR) 10Dzahn: "merge already?" [puppet] - 10https://gerrit.wikimedia.org/r/183382 (https://phabricator.wikimedia.org/T86053) (owner: 10Hashar) [21:36:05] ori: I hadn't considered conditional hash_data() before. it's possible that works (and if it does, then I don't think we even need the Vary part anymore, either). [21:36:25] ori: but let me get through with some other stuff and I'll go back through it all again with fresh eyes later today or early tomorrow [21:36:31] bblack: conditional hash_data() is what the default vcl_hash does, too [21:36:48] I mean, whether it works out in this particular case :) [21:36:55] https://www.varnish-cache.org/trac/browser/bin/varnishd/default.vcl?rev=3.0#L84 [21:36:57] right. [21:37:03] but like I said, we can still kill the Vary part if it does [21:37:13] right [21:39:28] 3Wikimedia-Git-or-Gerrit, Code-Review, operations: Chrome warns about insecure certificate on gerrit.wikimedia.org - https://phabricator.wikimedia.org/T76562#964321 (10RobH) a:3RobH [21:39:55] 3Wikimedia-Git-or-Gerrit, Code-Review, operations: Chrome warns about insecure certificate on gerrit.wikimedia.org - https://phabricator.wikimedia.org/T76562#805957 (10RobH) I'll test this out with setting up the misc web to handle it and locally hacking my /etc/hosts, not changing dns [21:40:23] (03CR) 10Dzahn: [C: 031] "i'm not a shinken expert yet but it looks so similar to the existing check above, hard to imagine why it wouldn't just work" [puppet] - 10https://gerrit.wikimedia.org/r/183454 (https://phabricator.wikimedia.org/T54867) (owner: 10Hashar) [21:44:30] (03PS1) 10RobH: setting up no caching on gerrit behind misc-web [puppet] - 10https://gerrit.wikimedia.org/r/183670 [21:47:05] salt: command not found [21:47:11] on terbium [21:47:25] did you mean: Command 'salat' from package 'fortunes-de' (universe) [21:47:33] hahaha :D [21:49:12] hoo: :) [21:50:29] 3operations, Release-Engineering: Determine Trebuchet/git-deploy maintenance plan - https://phabricator.wikimedia.org/T85008#964352 (10greg) [21:51:17] also The program 'dsh' is currently not installed [21:51:19] what the heck [21:51:29] dsh is definitely on tin [21:52:39] !log fixing scap permissions on mediawiki-installation servers via dsh [21:52:45] Logged the message, Master [21:52:49] Reedy: yes, ^ as requested [21:52:54] (03CR) 10BBlack: [C: 031] "Seems legit for testing the concept" [puppet] - 10https://gerrit.wikimedia.org/r/183670 (owner: 10RobH) [21:53:06] whee [21:53:17] !log reedy Synchronized wmf-config/InitialiseSettings.php: nooop to test scap update (duration: 00m 06s) [21:53:24] Reedy: where can we log the bug? [21:53:38] (03CR) 10RobH: [C: 031] "I think this is good to go, and since DNS isn't changing, it won't change gerrit use for users with only this patchset." [puppet] - 10https://gerrit.wikimedia.org/r/183670 (owner: 10RobH) [21:54:03] (03CR) 10RobH: [C: 032] setting up no caching on gerrit behind misc-web [puppet] - 10https://gerrit.wikimedia.org/r/183670 (owner: 10RobH) [21:54:23] (03CR) 10Ottomata: "one comment, aside from that LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/170668 (owner: 10QChris) [21:54:40] robh: wait [21:55:26] mutante: ... shit [21:55:36] its merged and already running on one,whats wrong? [21:56:01] ytterbium already has a definition to use port 8080 [21:57:13] (03PS6) 10QChris: Install maven on analytics clients [puppet] - 10https://gerrit.wikimedia.org/r/170668 [22:04:40] 3operations: replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#964407 (10Chmarkine) [22:04:43] (03PS7) 10Ottomata: Install maven on analytics clients [puppet] - 10https://gerrit.wikimedia.org/r/170668 (owner: 10QChris) [22:04:51] (03CR) 10Ottomata: [C: 032 V: 032] Install maven on analytics clients [puppet] - 10https://gerrit.wikimedia.org/r/170668 (owner: 10QChris) [22:08:35] (03PS1) 10Ottomata: Move maven init.pp into proper directory [puppet] - 10https://gerrit.wikimedia.org/r/183706 [22:08:42] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: puppet fail [22:09:17] (03CR) 10jenkins-bot: [V: 04-1] Move maven init.pp into proper directory [puppet] - 10https://gerrit.wikimedia.org/r/183706 (owner: 10Ottomata) [22:10:46] (03PS2) 10Ottomata: Move maven init.pp into proper directory [puppet] - 10https://gerrit.wikimedia.org/r/183706 [22:12:00] (03CR) 10Ottomata: [C: 032] Move maven init.pp into proper directory [puppet] - 10https://gerrit.wikimedia.org/r/183706 (owner: 10Ottomata) [22:13:29] (03PS1) 10Ottomata: Set proper source file path for maven settings.xml [puppet] - 10https://gerrit.wikimedia.org/r/183707 [22:14:16] (03CR) 10Ottomata: [C: 032] Set proper source file path for maven settings.xml [puppet] - 10https://gerrit.wikimedia.org/r/183707 (owner: 10Ottomata) [22:14:18] (03PS4) 10Reedy: Added ang.wikibooks and ie.wikibooks to closed.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180451 (https://phabricator.wikimedia.org/T78667) (owner: 10Dzahn) [22:14:25] (03CR) 10Reedy: [C: 031] Added ang.wikibooks and ie.wikibooks to closed.dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/180451 (https://phabricator.wikimedia.org/T78667) (owner: 10Dzahn) [22:16:32] PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: puppet fail [22:17:02] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [22:18:11] (03PS1) 10Ottomata: Install jq on Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/183710 [22:18:41] (03PS2) 10Ottomata: Install jq on Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/183710 [22:19:26] (03CR) 10jenkins-bot: [V: 04-1] Install jq on Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/183710 (owner: 10Ottomata) [22:20:11] (03PS3) 10Ottomata: Install jq on Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/183710 [22:20:12] RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [22:20:44] 3operations: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#964481 (10faidon) p:5Normal>3High a:3RobH [22:20:53] robh: I assigned this to you [22:21:12] it's becoming increasingly important I think [22:21:18] (03PS4) 10Ottomata: Install jq on Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/183710 [22:21:26] i have the rt ticket too but phab is better [22:21:30] (the rt was in procurement) [22:21:45] yeah this is mostly procurement, but since there is a phab already and users are commenting on it... [22:22:05] I used phab's [ ], you can edit task and make it [x] and it will check the checkmark [22:22:07] (03CR) 10Ottomata: [C: 032] Install jq on Hadoop worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/183710 (owner: 10Ottomata) [22:22:24] there's also a separate task for gerrit, I moved it into a subtask of this one [22:22:32] really? just in the text paravoid? [22:22:42] das purty cool [22:22:45] yeah, like github [22:24:40] oh, didn't know github did that either [22:25:48] yeah, see e.g. https://github.com/facebook/hhvm/issues/1480 [22:25:53] off the top of my head [22:27:02] PROBLEM - Varnishkafka Delivery Errors per minute on cp3019 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [20000.0] [22:32:42] PROBLEM - Varnishkafka Delivery Errors per minute on cp3004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [20000.0] [22:32:42] PROBLEM - Varnishkafka Delivery Errors per minute on cp3005 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [20000.0] [22:32:42] PROBLEM - Varnishkafka Delivery Errors per minute on cp3022 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [20000.0] [22:32:52] PROBLEM - Varnishkafka Delivery Errors per minute on cp3015 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [20000.0] [22:34:21] RECOVERY - Varnishkafka Delivery Errors per minute on cp3019 is OK: OK: Less than 1.00% above the threshold [0.0] [22:34:49] ottomata: 11.11% 22.22% 33.33%? [22:34:51] wtf? :) [22:35:41] hm. [22:35:49] different boxes too. [22:36:59] paravoid, ottomata: 1/9,2/9, 3/9 [22:38:49] hm, there are 10 datapoints in each interval that it is looking at [22:39:14] or,mabey 9, it is looking at the last 10 minutes [22:39:18] 1 per minute [22:40:02] RECOVERY - Varnishkafka Delivery Errors per minute on cp3004 is OK: OK: Less than 1.00% above the threshold [0.0] [22:40:02] PROBLEM - Varnishkafka Delivery Errors per minute on cp3021 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [20000.0] [22:40:22] RECOVERY - Varnishkafka Delivery Errors per minute on cp3015 is OK: OK: Less than 1.00% above the threshold [0.0] [22:41:12] RECOVERY - Varnishkafka Delivery Errors per minute on cp3005 is OK: OK: Less than 1.00% above the threshold [0.0] [22:41:13] 3Wikimedia-Git-or-Gerrit, Code-Review, operations: Chrome warns about insecure certificate on gerrit.wikimedia.org - https://phabricator.wikimedia.org/T76562#964609 (10RobH) So this won't work behind misc-web-lb until after the gerrit dependency on using sshd on the same interface is happening (since misc-web-lb... [22:41:21] RECOVERY - Varnishkafka Delivery Errors per minute on cp3022 is OK: OK: Less than 1.00% above the threshold [0.0] [22:42:44] 3ops-requests: set up DMARC aggregate report collection into a database for research and reporting - https://phabricator.wikimedia.org/T86209#964613 (10Aklapper) [22:45:21] PROBLEM - Varnishkafka Delivery Errors per minute on cp3003 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [20000.0] [22:47:21] RECOVERY - Varnishkafka Delivery Errors per minute on cp3021 is OK: OK: Less than 1.00% above the threshold [0.0] [22:50:33] (03CR) 10GWicke: [C: 031] Temporarily add Elasticsearch to einsteinium [puppet] - 10https://gerrit.wikimedia.org/r/181612 (owner: 10Manybubbles) [22:51:12] RECOVERY - Varnishkafka Delivery Errors per minute on cp3003 is OK: OK: Less than 1.00% above the threshold [0.0] [22:56:47] 3operations: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#964633 (10RobH) There is an embarassing # of them in SHA1. As revoking and reissue via RapidSSL is very difficult (they have to cancel the order entirely to revoke, but then it takes multiple calls with support to get... [22:57:24] 3operations: Replace SHA1 certificates with SHA256 - https://phabricator.wikimedia.org/T73156#964634 (10RobH) well, some of those are actually going away and I meant to pull them before pasting: virt0.wikimedia.org, virt1000.wikimedia.org, unified.wikimedia.org [23:21:55] (03PS2) 10Ori.livneh: memcached: use a unix socket instead of a tcp connection on selected hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183526 (owner: 10Giuseppe Lavagetto) [23:22:23] 3Wikimedia-General-or-Unknown, WMF-Legal, operations: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270#964695 (10LuisV_WMF) For simple puppet confs, I'd recommend CCO (mostly they are so factual it is unlikely there is much that is copyrightable in them anyway). However, since th... [23:23:51] (03PS3) 10Ori.livneh: memcached: use a unix socket instead of a tcp connection on selected hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183526 (owner: 10Giuseppe Lavagetto) [23:24:21] (03CR) 10Ori.livneh: [C: 032] memcached: use a unix socket instead of a tcp connection on selected hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183526 (owner: 10Giuseppe Lavagetto) [23:26:50] !log depooling mw1230 and mw1231 for a couple of minutes for I4c4691e26 [23:26:57] Logged the message, Master [23:28:31] PROBLEM - Varnishkafka Delivery Errors per minute on cp3019 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [20000.0] [23:29:07] (03Merged) 10jenkins-bot: memcached: use a unix socket instead of a tcp connection on selected hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183526 (owner: 10Giuseppe Lavagetto) [23:30:08] !log ori Synchronized wmf-config/mc.php: I4c4691e26: memcached: use a unix socket instead of a tcp connection on selected hosts (duration: 00m 06s) [23:30:11] Logged the message, Master [23:31:51] ori: revert [23:31:56] people logged out [23:32:11] ori: ^ [23:32:15] gadgets broken [23:32:17] ok [23:32:47] !log ori Synchronized wmf-config/mc.php: Revert: I4c4691e26: memcached: use a unix socket instead of a tcp connection on selected hosts (duration: 00m 06s) [23:32:50] Logged the message, Master [23:33:01] reverted with a local edit, now let me see what went wrong [23:33:11] ok [23:33:33] * ^d gives ori his gold star for the week :) [23:33:52] ori: I see it [23:34:13] (03CR) 10Hoo man: memcached: use a unix socket instead of a tcp connection on selected hosts (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183526 (owner: 10Giuseppe Lavagetto) [23:34:26] yeah. [23:34:31] RECOVERY - Varnishkafka Delivery Errors per minute on cp3019 is OK: OK: Less than 1.00% above the threshold [0.0] [23:34:34] thanks. [23:34:51] So, it may be nothing, but I think my account may be compromised. [23:35:06] BDD_: Have you jsut been logged out? and now logged in again? [23:35:11] That's known [23:35:18] I was just logged out, yeah, and now my password won't work. [23:35:28] Does it still not work? [23:35:30] Hello, I'm trying to report a deletion of hate speech on one Ryulongs talk page; which is absent from their edit history. Is this the right place? Also, I wasn't able to log onto my account for abouta week during my first ANI discussions several months ago. [23:35:49] Ah, hold on--previously I got an error when I did a forgotten password thing. I'll try that again. [23:36:25] (03PS1) 10Ori.livneh: Fix for I4c4691e26: make 'servers' value an array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183728 [23:36:32] hoo: ^ [23:36:38] (03CR) 10jenkins-bot: [V: 04-1] Fix for I4c4691e26: make 'servers' value an array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183728 (owner: 10Ori.livneh) [23:36:43] (03PS2) 10Hoo man: Fix for I4c4691e26: make 'servers' value an array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183728 (owner: 10Ori.livneh) [23:36:57] PS1 that did get a code-review was correct, though [23:37:05] (03CR) 10Hoo man: [C: 031] Fix for I4c4691e26: make 'servers' value an array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183728 (owner: 10Ori.livneh) [23:37:10] Alright, I'm back. False alarm, I guess. [23:37:29] paravoid: yes, my cosmetic follow-up was neither correct nor cosmetic [23:37:35] heh [23:37:52] I strongly suspect David Gerard was involved as not many people could delete edits from the record; this incident is what prompted Ryulong to revert themselves in allowing the category size change to be added to Super Sentei. Now David Gerard is edit warring on another Wiki saying slavery isn't historically important to the U.S. Presidents; yet it is for Muhammad. [23:37:59] postmortem-worthy? [23:38:04] nah [23:38:12] CensoredScribe: You're in the wrong channel. [23:38:17] CensoredScribe: it might be better to make a ticket in OTRS for that [23:38:32] (03CR) 10Ori.livneh: [C: 032] Fix for I4c4691e26: make 'servers' value an array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183728 (owner: 10Ori.livneh) [23:38:34] CensoredScribe: You probably want #wikipedia-en-unblock or #wikipedia-en or somewhere else. This channel is about server operations. [23:38:49] CensoredScribe: ^ that or try #wikimedia-otrs [23:39:02] Cool, thank you very much. [23:39:08] (03Merged) 10jenkins-bot: Fix for I4c4691e26: make 'servers' value an array [mediawiki-config] - 10https://gerrit.wikimedia.org/r/183728 (owner: 10Ori.livneh) [23:42:09] hoo: thanks again [23:42:47] you're welcome [23:46:53] !log ori Synchronized wmf-config/mc.php: (no message) (duration: 00m 07s) [23:46:56] Logged the message, Master [23:49:50] (03PS1) 10Milimetric: Add job to crunch Language team data [puppet] - 10https://gerrit.wikimedia.org/r/183734 [23:59:23] paravoid: nutcracker's socket file permission is 0755; it needs to be writable by apache to be usable [23:59:38] paravoid: how should that be done? via --umask arg for start-stop-daemon in the init script?