[00:04:08] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [00:11:32] !log catrope Synchronized php-1.26wmf3/extensions/WikimediaEvents/: SWAT (duration: 00m 45s) [00:11:38] Logged the message, Master [00:12:54] !log catrope Synchronized php-1.26wmf2/extensions/Flow: SWAT (duration: 00m 41s) [00:12:57] Logged the message, Master [00:13:22] !log catrope Synchronized php-1.26wmf3/extensions/Flow: SWAT (duration: 00m 28s) [00:13:26] Logged the message, Master [00:16:57] (03PS3) 10GWicke: Load HTML directly from RESTBase for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206319 (https://phabricator.wikimedia.org/T95229) [00:17:08] (03PS3) 10GWicke: Load HTML directly from RESTBase on all wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206320 (https://phabricator.wikimedia.org/T95229) [00:18:57] PROBLEM - HTTP error ratio anomaly detection on graphite1001 is CRITICAL Anomaly detected: 10 data above and 8 below the confidence bounds [00:24:53] (03PS1) 10Ori.livneh: varnish: set do_gzip to true for mobile [puppet] - 10https://gerrit.wikimedia.org/r/207003 [00:24:57] ^ bblack [00:25:04] (03CR) 10Jforrester: [C: 031] Load HTML directly from RESTBase for enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206319 (https://phabricator.wikimedia.org/T95229) (owner: 10GWicke) [00:28:18] (03CR) 10BBlack: [C: 032] varnish: set do_gzip to true for mobile [puppet] - 10https://gerrit.wikimedia.org/r/207003 (owner: 10Ori.livneh) [00:32:30] (03CR) 10Jforrester: Use /api/rest_v1/ entry point for VE, take two. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206502 (owner: 10GWicke) [00:34:18] (03PS1) 10Jforrester: Follow-up eca4279: Unbreak VisualEditor loading in Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207006 [00:35:02] (03CR) 10Catrope: [C: 032] Follow-up eca4279: Unbreak VisualEditor loading in Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207006 (owner: 10Jforrester) [00:35:07] (03Merged) 10jenkins-bot: Follow-up eca4279: Unbreak VisualEditor loading in Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207006 (owner: 10Jforrester) [00:37:21] !log rebooting cp3030 [00:37:27] Logged the message, Master [00:49:08] RECOVERY - HTTP error ratio anomaly detection on graphite1001 is OK No anomaly detected [00:54:33] (03CR) 10GWicke: "It looks like you rewrote much of dump_restbase in python ;) I thought that we'd just create a script to set up the directories & do some " (032 comments) [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/206849 (owner: 10ArielGlenn) [01:07:27] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 728.845692291 [01:12:28] PROBLEM - Varnishkafka Delivery Errors per minute on cp4007 is CRITICAL 11.11% of data above the critical threshold [20000.0] [01:15:48] RECOVERY - Varnishkafka Delivery Errors per minute on cp4007 is OK Less than 1.00% above the threshold [0.0] [01:26:37] PROBLEM - puppet last run on analytics1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:26:38] PROBLEM - RAID on analytics1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:27:37] PROBLEM - dhclient process on analytics1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:27:37] PROBLEM - Hadoop DataNode on analytics1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:27:59] PROBLEM - salt-minion processes on analytics1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:28:19] PROBLEM - DPKG on analytics1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:28:28] PROBLEM - Disk space on analytics1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:28:28] PROBLEM - Hadoop NodeManager on analytics1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:28:29] PROBLEM - configured eth on analytics1013 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [01:28:48] PROBLEM - SSH on analytics1013 is CRITICAL - Socket timeout after 10 seconds [01:29:23] ^ is that someone working or random failure? [01:31:49] PROBLEM - Varnishkafka Delivery Errors per minute on cp4006 is CRITICAL 11.11% of data above the critical threshold [20000.0] [01:32:37] PROBLEM - Varnishkafka Delivery Errors per minute on cp4005 is CRITICAL 11.11% of data above the critical threshold [20000.0] [01:35:10] bblack: the pattern of analyticsNNNN alert followed by a cpNNNN alerts suggests the problem is on the receiving side, which is consistent with the variety of alerts we're getting for analytics1013 [01:35:17] RECOVERY - Varnishkafka Delivery Errors per minute on cp4006 is OK Less than 1.00% above the threshold [0.0] [01:35:39] bblack: I think calling Otto would be warranted at this point [01:37:24] well [01:37:38] RECOVERY - Varnishkafka Delivery Errors per minute on cp4005 is OK Less than 1.00% above the threshold [0.0] [01:38:22] I'm pretty sure the cp* vk drerr noise lately is because I'm installing OS updates on them, and ulsfo (cp40xx) in particular seems to be having minor link issues, and vk is very sensitive to losing a few packets in the mess of it all. [01:38:34] so, probably unrelated to analytics host issues [01:38:59] were there other analyticsNNNN deaths earlier today? [01:39:37] [01:07:27] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 728.845692291 [01:39:52] ah yeah I see it now [01:40:05] the 1013 looks almost like the whole host died, but then it still pings [01:40:13] (but I can't ssh into it so far) [01:40:15] analytics1016 died earlier too [01:40:56] yeah, this doesn't look good: http://ganglia.wikimedia.org/latest/?r=day&cs=&ce=&c=Analytics+cluster+eqiad&h=&tab=m&vn=&hide-hf=false&m=cpu_report&sh=1&z=small&hc=4&host_regex=&max_graphs=0&s=by+name [01:41:24] yeah I see, looks like mutante made a ticket for 1016, etc [01:41:41] two hosts down, one 15 mins ago, the other 3 hours ago, cluster-wide spike in nprocs and bytes in/out [01:41:55] https://phabricator.wikimedia.org/T97349#1240076 [01:42:07] mutante just logged it in icinga, otto filed the ticket heh [01:43:53] (03PS1) 10Ori.livneh: varnish: set do_gzip to true for text [puppet] - 10https://gerrit.wikimedia.org/r/207013 [01:44:16] let me dig a little here. two random hw issues in one day is usually not a hw issue :P [01:44:27] yeah. switch failing? [01:45:14] !log rebooting analytics1016 [01:45:18] err, 13 [01:45:21] Logged the message, Master [01:45:32] !log rebooting analytics1013 (not 1016) [01:45:36] Logged the message, Master [01:46:15] well bios redirection works fine, unlike the claims about 1016 earlier [01:47:14] logmsgbot --amend ? :) [01:48:08] RECOVERY - salt-minion processes on analytics1013 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:48:10] you can edit the SAL on the wiki [01:48:18] RECOVERY - puppet last run on analytics1013 is OK Puppet is currently enabled, last run 38 minutes ago with 0 failures [01:48:19] RECOVERY - RAID on analytics1013 is OK no disks configured for RAID [01:48:28] RECOVERY - DPKG on analytics1013 is OK: All packages OK [01:48:37] RECOVERY - Disk space on analytics1013 is OK: DISK OK [01:48:37] RECOVERY - Hadoop NodeManager on analytics1013 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [01:48:37] RECOVERY - configured eth on analytics1013 is OK - interfaces up [01:48:48] RECOVERY - SSH on analytics1013 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [01:49:17] RECOVERY - dhclient process on analytics1013 is OK: PROCS OK: 0 processes with command name dhclient [01:49:17] RECOVERY - Hadoop DataNode on analytics1013 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [01:49:19] no crash in the logs :/ [01:49:28] PROBLEM - DPKG on cp4003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [01:49:31] but console was unresponsive before reboot from the mgmt port [01:49:58] PROBLEM - DPKG on cp4004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [01:50:00] no anything in the logs really, just stopped [01:50:10] cp400[34] dpkg is me [01:51:08] RECOVERY - DPKG on cp4003 is OK: All packages OK [01:51:38] RECOVERY - DPKG on cp4004 is OK: All packages OK [01:55:20] ori: you still think I should ping otto re: high cluster load / strange traffic patterns / 1013 dying in much the same manner as 1016 earlier (but luckily with a working console)? [01:55:27] are we losing stats? [01:56:10] from dmesg on analytics1013: [01:56:12] [ 397.057413] CPU21: Core temperature above threshold, cpu clock throttled (total events = 52279) [01:56:12] [ 397.057422] CPU21: Package temperature above threshold, cpu clock throttled (total events = 56703) [01:56:12] [ 397.058365] CPU21: Core temperature/speed normal [01:56:14] [ 397.058379] CPU21: Package temperature/speed normal [01:56:16] [ 397.081513] CPU9: Core temperature/speed normal [01:56:18] [ 450.787766] mce: [Hardware Error]: Machine check events logged [01:56:54] is that normal? there are a bunch of CPU temperature alerts [01:57:15] I've come to think of the temp alerts as normal, yes [01:57:21] mce, no [01:58:26] oh and in the temp spam, I didn't see the kernel debug stuff a bit earlier than the hang itselfg [01:58:43] pr 28 01:34:12 analytics1013 kernel: [13933773.964528] INFO: task impalad:13954 blocked for more than 120 seconds. [01:58:48] (and several similar) [02:04:35] mcelog is relatively-new to us, it could be that mcelog is just spamming along with the other temp/speed alerts [02:04:58] (which should be investigated too, but haven't proved problematic elsewhere, and seem to be maybe a config issue...) [02:05:37] one of analytics10{11,13,14,15} will blow up next, going by ganglia [02:06:41] the cluster runs map-reduce jobs submitted by users who are not always (or often) developers, and i vaguely recall load being an issue in the past [02:07:30] so if i had to guess, there is some stupidly expensive job running, and either the hardware got old enough or the job is expensive enough that the overheating thing is finally becoming an issue [02:07:54] I guess we're missing our usualy phab bot? [02:08:00] booo, it died again [02:08:07] I keep telling them to take the redis dependency out [02:08:17] or at least re-connect [02:08:27] in any case: https://phabricator.wikimedia.org/T97380 <- analytics1013 [02:08:38] restarted [02:09:04] it’s the same program, instead of just passing params to a subfunction it puts them on redis and the other function just runs as another process and reads that... [02:09:21] (all of that is partially my fault because grrrit-wm used to do that too) [02:10:36] 6operations, 10Analytics: analytics1013 crashed, investigate... - https://phabricator.wikimedia.org/T97380#1240666 (10BBlack) Note that aside from the hung task stuff above, there was no final kernel crash output or anything, and other "normal" logging continues through about 01:47. When icinga alerted on all... [02:10:40] there it is [02:11:00] 6operations, 10MediaWiki-DjVu, 10MediaWiki-General-or-Unknown, 6Multimedia, and 2 others: img_metadata queries for Djvu files regularly saturate s4 slaves - https://phabricator.wikimedia.org/T96360#1240667 (10GOIII) What is the story re: XMLs and //DjVuImage.php// ? [02:11:37] wikibugs: welcome back! [02:12:28] ok gotta run [02:12:31] i'd call otto, yeah [02:12:46] if it's not an issue then it shouldn't be an alert [02:13:17] is there some way to plot cpu temperature trends? [02:14:17] I hit him up with an overview of the above on hangouts. If I don't see a response there in a few I'll try text/voice. [02:14:31] s/hangouts/gtalk/ or whatever [02:14:38] ori: not that we have in our current stats [02:14:46] bblack: he left [02:14:49] (ori) [02:15:18] PROBLEM - DPKG on cp4002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [02:15:28] ^ still me, I bet 4001 is next! [02:16:25] (or not. it never is when you actually call it) [02:16:38] PROBLEM - puppet last run on cp4002 is CRITICAL Puppet last ran 4 hours ago [02:16:58] RECOVERY - DPKG on cp4002 is OK: All packages OK [02:20:08] PROBLEM - puppet last run on cp1066 is CRITICAL Puppet has 23 failures [02:20:17] PROBLEM - puppet last run on mw1154 is CRITICAL Puppet has 1 failures [02:20:18] PROBLEM - puppet last run on cp1064 is CRITICAL Puppet has 18 failures [02:20:18] PROBLEM - puppet last run on eeden is CRITICAL Puppet has 6 failures [02:20:18] PROBLEM - puppetmaster backend https on palladium is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 500 Internal Server Error [02:20:18] PROBLEM - puppet last run on mw1095 is CRITICAL puppet fail [02:20:28] PROBLEM - puppet last run on mw1078 is CRITICAL puppet fail [02:20:28] PROBLEM - puppet last run on virt1005 is CRITICAL Puppet has 10 failures [02:20:28] PROBLEM - puppet last run on es1004 is CRITICAL Puppet has 15 failures [02:20:29] PROBLEM - puppet last run on mw1075 is CRITICAL puppet fail [02:20:29] PROBLEM - puppet last run on mw2101 is CRITICAL puppet fail [02:20:29] PROBLEM - puppet last run on mw2111 is CRITICAL puppet fail [02:20:37] PROBLEM - puppet last run on db1005 is CRITICAL Puppet has 23 failures [02:20:38] PROBLEM - puppet last run on db2056 is CRITICAL Puppet has 9 failures [02:20:47] PROBLEM - puppet last run on lvs3002 is CRITICAL puppet fail [02:20:48] PROBLEM - puppet last run on mw2137 is CRITICAL Puppet has 19 failures [02:20:48] PROBLEM - puppet last run on eventlog2001 is CRITICAL Puppet has 12 failures [02:20:48] PROBLEM - puppet last run on cp3044 is CRITICAL puppet fail [02:20:48] PROBLEM - puppet last run on cp3036 is CRITICAL Puppet has 29 failures [02:20:48] PROBLEM - puppet last run on db1068 is CRITICAL Puppet has 10 failures [02:20:57] PROBLEM - puppet last run on pollux is CRITICAL puppet fail [02:20:57] PROBLEM - puppet last run on ms-be3004 is CRITICAL puppet fail [02:20:58] PROBLEM - puppet last run on db1058 is CRITICAL puppet fail [02:20:58] PROBLEM - puppet last run on virt1009 is CRITICAL puppet fail [02:20:58] PROBLEM - puppet last run on mc2015 is CRITICAL Puppet has 7 failures [02:20:58] PROBLEM - puppet last run on labvirt1001 is CRITICAL Puppet has 23 failures [02:20:58] PROBLEM - puppet last run on mw1019 is CRITICAL puppet fail [02:20:59] PROBLEM - puppet last run on labsdb1002 is CRITICAL puppet fail [02:20:59] PROBLEM - puppet last run on mw2129 is CRITICAL Puppet has 18 failures [02:21:00] PROBLEM - puppet last run on mw2203 is CRITICAL puppet fail [02:21:00] PROBLEM - puppet last run on ms-be1014 is CRITICAL puppet fail [02:21:07] PROBLEM - puppet last run on mw1157 is CRITICAL Puppet has 35 failures [02:21:07] PROBLEM - puppet last run on mc1015 is CRITICAL Puppet has 6 failures [02:21:08] PROBLEM - puppet last run on mw1015 is CRITICAL Puppet has 28 failures [02:21:08] PROBLEM - puppet last run on mw1102 is CRITICAL Puppet has 35 failures [02:21:08] PROBLEM - puppet last run on mw1194 is CRITICAL Puppet has 39 failures [02:21:08] PROBLEM - puppet last run on elastic1013 is CRITICAL Puppet has 24 failures [02:21:08] PROBLEM - puppet last run on elastic1009 is CRITICAL Puppet has 18 failures [02:21:17] PROBLEM - puppet last run on restbase1003 is CRITICAL Puppet has 13 failures [02:21:17] PROBLEM - puppet last run on hydrogen is CRITICAL Puppet has 23 failures [02:21:18] PROBLEM - puppet last run on db1009 is CRITICAL puppet fail [02:21:18] PROBLEM - puppet last run on elastic1010 is CRITICAL Puppet has 26 failures [02:21:18] PROBLEM - puppet last run on lvs1006 is CRITICAL puppet fail [02:21:18] PROBLEM - puppet last run on db1019 is CRITICAL puppet fail [02:21:18] PROBLEM - puppet last run on ms-be1010 is CRITICAL Puppet has 1 failures [02:21:19] PROBLEM - puppet last run on cp3007 is CRITICAL Puppet has 22 failures [02:21:27] PROBLEM - puppet last run on analytics1012 is CRITICAL Puppet has 11 failures [02:21:27] PROBLEM - puppet last run on db2047 is CRITICAL Puppet has 27 failures [02:21:27] PROBLEM - puppet last run on wtp2010 is CRITICAL Puppet has 8 failures [02:21:28] PROBLEM - puppet last run on analytics1029 is CRITICAL puppet fail [02:21:28] PROBLEM - puppet last run on neon is CRITICAL puppet fail [02:21:28] PROBLEM - puppet last run on db1049 is CRITICAL puppet fail [02:21:28] PROBLEM - puppet last run on mw1113 is CRITICAL Puppet has 25 failures [02:21:29] PROBLEM - puppet last run on mw2142 is CRITICAL Puppet has 29 failures [02:21:29] PROBLEM - puppet last run on wtp2004 is CRITICAL Puppet has 23 failures [02:21:30] PROBLEM - puppet last run on mw2195 is CRITICAL Puppet has 49 failures [02:21:30] PROBLEM - puppet last run on mw2182 is CRITICAL Puppet has 37 failures [02:21:31] PROBLEM - puppet last run on mw2078 is CRITICAL Puppet has 18 failures [02:21:31] PROBLEM - puppet last run on mw2098 is CRITICAL puppet fail [02:21:37] PROBLEM - puppet last run on mw2060 is CRITICAL Puppet has 30 failures [02:21:37] PROBLEM - puppet last run on mw2024 is CRITICAL Puppet has 42 failures [02:21:37] PROBLEM - puppet last run on bast2001 is CRITICAL Puppet has 17 failures [02:21:37] PROBLEM - puppet last run on palladium is CRITICAL Puppet has 32 failures [02:21:37] PROBLEM - puppet last run on mw1047 is CRITICAL Puppet has 37 failures [02:21:38] PROBLEM - puppet last run on mw1137 is CRITICAL Puppet has 69 failures [02:21:38] PROBLEM - puppet last run on ocg1002 is CRITICAL Puppet has 27 failures [02:21:39] PROBLEM - puppet last run on ms-be1005 is CRITICAL puppet fail [02:21:39] PROBLEM - puppet last run on cp4013 is CRITICAL puppet fail [02:21:39] wtf? [02:21:57] PROBLEM - puppet last run on mw1182 is CRITICAL puppet fail [02:21:57] PROBLEM - puppet last run on mw1230 is CRITICAL Puppet has 36 failures [02:21:57] PROBLEM - puppet last run on mw1058 is CRITICAL Puppet has 72 failures [02:21:58] PROBLEM - puppet last run on es1005 is CRITICAL Puppet has 13 failures [02:21:58] PROBLEM - puppet last run on tmh1002 is CRITICAL Puppet has 54 failures [02:21:58] PROBLEM - puppet last run on es1003 is CRITICAL Puppet has 23 failures [02:21:58] PROBLEM - puppet last run on cp4016 is CRITICAL Puppet has 21 failures [02:21:59] PROBLEM - puppet last run on cp4010 is CRITICAL Puppet has 16 failures [02:21:59] PROBLEM - puppet last run on cp4011 is CRITICAL Puppet has 31 failures [02:22:00] PROBLEM - puppet last run on cp3048 is CRITICAL Puppet has 46 failures [02:22:00] PROBLEM - puppet last run on cp3020 is CRITICAL Puppet has 20 failures [02:22:01] PROBLEM - puppet last run on mw1101 is CRITICAL puppet fail [02:22:01] PROBLEM - puppet last run on mw1035 is CRITICAL puppet fail [02:22:02] PROBLEM - puppet last run on mw1110 is CRITICAL Puppet has 58 failures [02:22:03] puppetmaster death? [02:22:07] PROBLEM - puppet last run on tmh1001 is CRITICAL Puppet has 58 failures [02:22:08] PROBLEM - puppet last run on labsdb1007 is CRITICAL Puppet has 25 failures [02:22:08] PROBLEM - puppet last run on dbproxy1003 is CRITICAL Puppet has 18 failures [02:22:08] PROBLEM - puppet last run on cp1057 is CRITICAL puppet fail [02:22:09] PROBLEM - puppet last run on mw2172 is CRITICAL puppet fail [02:22:09] PROBLEM - puppet last run on mw2119 is CRITICAL Puppet has 30 failures [02:22:09] PROBLEM - puppet last run on mw2120 is CRITICAL Puppet has 41 failures [02:22:09] PROBLEM - puppet last run on lanthanum is CRITICAL Puppet has 31 failures [02:22:27] PROBLEM - puppet last run on erbium is CRITICAL puppet fail [02:22:27] PROBLEM - puppet last run on mw1232 is CRITICAL puppet fail [02:22:28] PROBLEM - puppet last run on mw2131 is CRITICAL Puppet has 40 failures [02:22:28] PROBLEM - puppet last run on mw2061 is CRITICAL Puppet has 76 failures [02:22:28] PROBLEM - puppet last run on cp4017 is CRITICAL Puppet has 19 failures [02:22:28] PROBLEM - puppet last run on cp4015 is CRITICAL Puppet has 31 failures [02:22:28] PROBLEM - puppet last run on db2049 is CRITICAL puppet fail [02:22:29] PROBLEM - puppet last run on cp3032 is CRITICAL Puppet has 24 failures [02:22:29] PROBLEM - puppet last run on mw1128 is CRITICAL Puppet has 38 failures [02:22:30] PROBLEM - puppet last run on zirconium is CRITICAL puppet fail [02:22:30] PROBLEM - puppet last run on mw1179 is CRITICAL Puppet has 32 failures [02:22:31] PROBLEM - puppet last run on mw1083 is CRITICAL puppet fail [02:22:31] PROBLEM - puppet last run on mw2198 is CRITICAL Puppet has 33 failures [02:22:37] PROBLEM - puppet last run on mw1070 is CRITICAL Puppet has 77 failures [02:22:37] PROBLEM - puppet last run on db2041 is CRITICAL puppet fail [02:22:37] PROBLEM - puppet last run on mw2084 is CRITICAL Puppet has 67 failures [02:22:37] PROBLEM - puppet last run on lvs2005 is CRITICAL Puppet has 13 failures [02:22:38] PROBLEM - puppet last run on mw1073 is CRITICAL Puppet has 64 failures [02:22:38] PROBLEM - puppet last run on mw2110 is CRITICAL Puppet has 33 failures [02:22:38] PROBLEM - puppet last run on cp1051 is CRITICAL puppet fail [02:22:39] PROBLEM - puppet last run on mw1017 is CRITICAL Puppet has 33 failures [02:22:39] PROBLEM - puppet last run on mw2138 is CRITICAL Puppet has 77 failures [02:22:57] PROBLEM - puppet last run on mw1127 is CRITICAL Puppet has 76 failures [02:22:57] PROBLEM - puppet last run on mw1221 is CRITICAL puppet fail [02:22:57] PROBLEM - puppet last run on mc1010 is CRITICAL puppet fail [02:22:58] PROBLEM - puppet last run on elastic1016 is CRITICAL Puppet has 29 failures [02:22:58] PROBLEM - puppet last run on mc1008 is CRITICAL Puppet has 10 failures [02:22:58] PROBLEM - puppet last run on mw1234 is CRITICAL Puppet has 32 failures [02:22:58] PROBLEM - puppet last run on mw1184 is CRITICAL Puppet has 29 failures [02:22:59] PROBLEM - puppet last run on cp1074 is CRITICAL puppet fail [02:22:59] PROBLEM - puppet last run on stat1001 is CRITICAL Puppet has 16 failures [02:23:00] PROBLEM - puppet last run on mw2211 is CRITICAL Puppet has 74 failures [02:23:00] PROBLEM - puppet last run on restbase1006 is CRITICAL Puppet has 13 failures [02:23:07] PROBLEM - puppet last run on analytics1024 is CRITICAL Puppet has 14 failures [02:23:08] PROBLEM - puppet last run on mw1020 is CRITICAL Puppet has 71 failures [02:23:08] PROBLEM - puppet last run on mw1252 is CRITICAL puppet fail [02:23:09] PROBLEM - puppet last run on wtp2014 is CRITICAL puppet fail [02:23:17] PROBLEM - puppet last run on mw1196 is CRITICAL Puppet has 32 failures [02:23:17] PROBLEM - puppet last run on ms-be2014 is CRITICAL puppet fail [02:23:17] PROBLEM - puppet last run on mw2005 is CRITICAL puppet fail [02:23:17] PROBLEM - puppet last run on mc2008 is CRITICAL puppet fail [02:23:17] PROBLEM - puppet last run on mw2053 is CRITICAL puppet fail [02:23:18] PROBLEM - puppet last run on wtp1024 is CRITICAL puppet fail [02:23:18] PROBLEM - puppet last run on mw1245 is CRITICAL puppet fail [02:23:19] PROBLEM - puppet last run on mw1013 is CRITICAL puppet fail [02:23:20] it's all "500 internal server error on palladium", nothing's wrong with the clients... [02:23:27] PROBLEM - puppet last run on es1009 is CRITICAL Puppet has 14 failures [02:23:28] PROBLEM - puppet last run on mw1036 is CRITICAL puppet fail [02:23:28] PROBLEM - puppet last run on mw1062 is CRITICAL puppet fail [02:23:28] PROBLEM - puppet last run on db1007 is CRITICAL puppet fail [02:23:28] PROBLEM - puppet last run on cp1065 is CRITICAL Puppet has 18 failures [02:23:28] PROBLEM - RAID on snapshot1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [02:23:29] PROBLEM - puppet last run on mw1136 is CRITICAL Puppet has 39 failures [02:23:37] PROBLEM - puppet last run on mw2041 is CRITICAL puppet fail [02:23:38] PROBLEM - puppet last run on lvs4001 is CRITICAL Puppet has 14 failures [02:23:38] PROBLEM - puppet last run on mw1216 is CRITICAL puppet fail [02:23:38] PROBLEM - puppet last run on ms-be1016 is CRITICAL Puppet has 17 failures [02:23:47] PROBLEM - puppet last run on elastic1028 is CRITICAL Puppet has 15 failures [02:23:47] PROBLEM - puppet last run on analytics1039 is CRITICAL Puppet has 16 failures [02:23:48] PROBLEM - puppet last run on cp1069 is CRITICAL Puppet has 15 failures [02:23:48] PROBLEM - puppet last run on analytics1019 is CRITICAL puppet fail [02:23:48] PROBLEM - puppet last run on mw2177 is CRITICAL Puppet has 30 failures [02:23:49] PROBLEM - puppet last run on mw2165 is CRITICAL Puppet has 36 failures [02:23:49] PROBLEM - puppet last run on mw2102 is CRITICAL puppet fail [02:23:58] PROBLEM - puppet last run on mw1246 is CRITICAL Puppet has 28 failures [02:24:07] PROBLEM - puppet last run on mw1257 is CRITICAL puppet fail [02:24:07] PROBLEM - puppet last run on mw1244 is CRITICAL puppet fail [02:24:07] PROBLEM - puppet last run on mw1218 is CRITICAL Puppet has 80 failures [02:24:07] PROBLEM - puppet last run on es1006 is CRITICAL puppet fail [02:24:08] PROBLEM - puppet last run on mw1233 is CRITICAL puppet fail [02:24:08] PROBLEM - puppet last run on mw2037 is CRITICAL Puppet has 75 failures [02:24:08] PROBLEM - puppet last run on mw2035 is CRITICAL Puppet has 83 failures [02:24:09] PROBLEM - puppet last run on db2061 is CRITICAL Puppet has 28 failures [02:24:09] PROBLEM - puppet last run on analytics1036 is CRITICAL puppet fail [02:24:10] PROBLEM - puppet last run on ocg1001 is CRITICAL Puppet has 32 failures [02:24:10] PROBLEM - puppet last run on db1041 is CRITICAL Puppet has 19 failures [02:24:17] PROBLEM - puppet last run on ytterbium is CRITICAL Puppet has 35 failures [02:24:17] PROBLEM - puppet last run on db2035 is CRITICAL puppet fail [02:24:18] PROBLEM - puppet last run on mw2112 is CRITICAL puppet fail [02:24:18] PROBLEM - puppet last run on mw1138 is CRITICAL Puppet has 69 failures [02:24:18] PROBLEM - puppet last run on mw2214 is CRITICAL puppet fail [02:24:18] PROBLEM - puppet last run on db2051 is CRITICAL puppet fail [02:24:18] PROBLEM - puppet last run on mw2160 is CRITICAL puppet fail [02:24:19] PROBLEM - puppet last run on db2048 is CRITICAL Puppet has 18 failures [02:24:19] PROBLEM - puppet last run on mw2133 is CRITICAL Puppet has 43 failures [02:24:20] PROBLEM - puppet last run on mw2181 is CRITICAL Puppet has 35 failures [02:24:20] PROBLEM - puppet last run on mw2025 is CRITICAL Puppet has 73 failures [02:24:24] !log l10nupdate Synchronized php-1.26wmf2/cache/l10n: (no message) (duration: 10m 17s) [02:24:27] PROBLEM - puppet last run on mc1004 is CRITICAL puppet fail [02:24:27] PROBLEM - puppet last run on dbstore1001 is CRITICAL puppet fail [02:24:27] PROBLEM - puppet last run on ganeti2005 is CRITICAL puppet fail [02:24:27] PROBLEM - puppet last run on ms-be2002 is CRITICAL puppet fail [02:24:27] PROBLEM - puppet last run on cp3013 is CRITICAL puppet fail [02:24:28] PROBLEM - puppet last run on pc1001 is CRITICAL Puppet has 33 failures [02:24:28] PROBLEM - puppet last run on analytics1034 is CRITICAL Puppet has 26 failures [02:24:29] PROBLEM - puppet last run on mw1040 is CRITICAL Puppet has 37 failures [02:24:29] PROBLEM - puppet last run on mw1096 is CRITICAL Puppet has 30 failures [02:24:32] [ pid=19116 file=ext/apache2/Hooks.cpp:727 time=2015-04-28 02:24:26.490 ]: [02:24:35] Unexpected error in mod_passenger: Could not connect to the ApplicationPool server: Broken pipe (32) [02:24:37] PROBLEM - puppet last run on rdb1004 is CRITICAL Puppet has 21 failures [02:24:37] PROBLEM - puppet last run on labsdb1001 is CRITICAL Puppet has 9 failures [02:24:38] PROBLEM - puppet last run on cp4007 is CRITICAL Puppet has 19 failures [02:24:38] PROBLEM - puppet last run on mw1134 is CRITICAL puppet fail [02:24:38] Backtrace: [02:24:38] PROBLEM - puppet last run on db1010 is CRITICAL puppet fail [02:24:38] PROBLEM - puppet last run on mw1132 is CRITICAL puppet fail [02:24:38] PROBLEM - puppet last run on elastic1026 is CRITICAL Puppet has 17 failures [02:24:39] Logged the message, Master [02:24:40] in 'Passenger::ApplicationPoolPtr Passenger::ApplicationPoolServer::connect()' (ApplicationPoolServer.h:746) [02:24:43] in 'int Hooks::handleRequest(request_rec*)' (Hooks.cpp:523) [02:24:47] PROBLEM - puppet last run on logstash1003 is CRITICAL Puppet has 30 failures [02:24:48] PROBLEM - puppet last run on cp1072 is CRITICAL puppet fail [02:24:48] PROBLEM - puppet last run on ms-fe1003 is CRITICAL puppet fail [02:24:48] PROBLEM - puppet last run on mw1124 is CRITICAL Puppet has 34 failures [02:24:49] PROBLEM - puppet last run on virt1012 is CRITICAL Puppet has 13 failures [02:24:49] PROBLEM - puppet last run on mw2197 is CRITICAL Puppet has 70 failures [02:24:49] PROBLEM - puppet last run on mw2106 is CRITICAL Puppet has 39 failures [02:24:49] PROBLEM - puppet last run on mw2179 is CRITICAL puppet fail [02:24:57] PROBLEM - puppet last run on ms-be1013 is CRITICAL puppet fail [02:24:57] PROBLEM - puppet last run on es2003 is CRITICAL Puppet has 9 failures [02:24:57] PROBLEM - puppet last run on mw2074 is CRITICAL puppet fail [02:24:57] PROBLEM - puppet last run on cp1059 is CRITICAL Puppet has 31 failures [02:24:57] PROBLEM - puppet last run on virt1002 is CRITICAL Puppet has 23 failures [02:24:58] PROBLEM - puppet last run on ms-be1017 is CRITICAL Puppet has 30 failures [02:24:58] PROBLEM - puppet last run on mw1140 is CRITICAL puppet fail [02:24:59] PROBLEM - puppet last run on mc1009 is CRITICAL puppet fail [02:24:59] PROBLEM - puppet last run on uranium is CRITICAL puppet fail [02:25:00] PROBLEM - puppet last run on mw1038 is CRITICAL puppet fail [02:25:00] PROBLEM - puppet last run on mw1256 is CRITICAL Puppet has 85 failures [02:25:01] PROBLEM - puppet last run on mw1178 is CRITICAL puppet fail [02:25:07] PROBLEM - puppet last run on ms-be1001 is CRITICAL Puppet has 21 failures [02:25:07] PROBLEM - puppet last run on wtp1021 is CRITICAL Puppet has 11 failures [02:25:08] PROBLEM - puppet last run on californium is CRITICAL puppet fail [02:25:08] PROBLEM - puppet last run on mw1072 is CRITICAL puppet fail [02:25:08] PROBLEM - puppet last run on mw2072 is CRITICAL Puppet has 39 failures [02:25:17] PROBLEM - puppet last run on mc1011 is CRITICAL Puppet has 10 failures [02:25:17] PROBLEM - puppet last run on mw1161 is CRITICAL Puppet has 81 failures [02:25:17] PROBLEM - puppet last run on wtp1009 is CRITICAL puppet fail [02:25:17] PROBLEM - puppet last run on mw2183 is CRITICAL puppet fail [02:25:17] PROBLEM - puppet last run on magnesium is CRITICAL puppet fail [02:25:18] PROBLEM - puppet last run on mw2099 is CRITICAL Puppet has 35 failures [02:25:18] PROBLEM - puppet last run on ms-be2013 is CRITICAL puppet fail [02:25:19] PROBLEM - puppet last run on elastic1031 is CRITICAL puppet fail [02:25:19] PROBLEM - puppet last run on mw1147 is CRITICAL Puppet has 81 failures [02:25:20] PROBLEM - puppet last run on analytics1015 is CRITICAL puppet fail [02:25:20] RECOVERY - puppetmaster backend https on palladium is OK: HTTP OK: Status line output matched 400 - 335 bytes in 0.026 second response time [02:25:21] PROBLEM - puppet last run on mw1080 is CRITICAL puppet fail [02:25:27] !log restarted apache2 on palladium - it was throwing infinite 500 errors due to some mod_passenger issue... [02:25:27] PROBLEM - puppet last run on mw1005 is CRITICAL puppet fail [02:25:27] PROBLEM - puppet last run on mw1109 is CRITICAL Puppet has 32 failures [02:25:27] PROBLEM - puppet last run on eventlog1001 is CRITICAL puppet fail [02:25:28] PROBLEM - puppet last run on mw1006 is CRITICAL puppet fail [02:25:28] PROBLEM - puppet last run on cp1043 is CRITICAL Puppet has 13 failures [02:25:28] PROBLEM - puppet last run on mw1031 is CRITICAL Puppet has 34 failures [02:25:28] PROBLEM - puppet last run on labnodepool1001 is CRITICAL Puppet has 11 failures [02:25:29] PROBLEM - puppet last run on mw2170 is CRITICAL Puppet has 81 failures [02:25:29] PROBLEM - puppet last run on mw2046 is CRITICAL Puppet has 34 failures [02:25:30] PROBLEM - puppet last run on mw2032 is CRITICAL puppet fail [02:25:32] Logged the message, Master [02:25:47] PROBLEM - puppet last run on mw2100 is CRITICAL puppet fail [02:25:48] PROBLEM - puppet last run on db1065 is CRITICAL puppet fail [02:25:57] PROBLEM - puppet last run on mw1145 is CRITICAL puppet fail [02:25:58] PROBLEM - puppet last run on analytics1025 is CRITICAL puppet fail [02:25:58] PROBLEM - puppet last run on db2070 is CRITICAL Puppet has 12 failures [02:25:58] PROBLEM - puppet last run on neptunium is CRITICAL Puppet has 33 failures [02:25:58] PROBLEM - puppet last run on mw1048 is CRITICAL Puppet has 81 failures [02:25:58] PROBLEM - puppet last run on db2009 is CRITICAL puppet fail [02:25:59] PROBLEM - puppet last run on wtp1017 is CRITICAL Puppet has 15 failures [02:25:59] PROBLEM - puppet last run on mw1026 is CRITICAL puppet fail [02:25:59] PROBLEM - puppet last run on db2039 is CRITICAL puppet fail [02:26:00] PROBLEM - puppet last run on mw2169 is CRITICAL puppet fail [02:26:00] PROBLEM - puppet last run on mw2202 is CRITICAL Puppet has 26 failures [02:26:01] PROBLEM - puppet last run on mw2103 is CRITICAL Puppet has 63 failures [02:26:07] PROBLEM - puppet last run on mw2121 is CRITICAL puppet fail [02:26:07] PROBLEM - puppet last run on cp4006 is CRITICAL puppet fail [02:26:08] PROBLEM - puppet last run on cp3015 is CRITICAL Puppet has 25 failures [02:26:08] PROBLEM - puppet last run on mw1067 is CRITICAL Puppet has 75 failures [02:26:08] PROBLEM - puppet last run on db1029 is CRITICAL Puppet has 24 failures [02:26:08] PROBLEM - puppet last run on mw1240 is CRITICAL Puppet has 77 failures [02:26:08] PROBLEM - puppet last run on mc1016 is CRITICAL Puppet has 24 failures [02:26:09] PROBLEM - puppet last run on analytics1017 is CRITICAL Puppet has 20 failures [02:26:09] PROBLEM - puppet last run on mw1115 is CRITICAL Puppet has 36 failures [02:26:10] PROBLEM - puppet last run on caesium is CRITICAL Puppet has 25 failures [02:26:17] PROBLEM - puppet last run on mw1007 is CRITICAL puppet fail [02:26:17] PROBLEM - puppet last run on ms-be1006 is CRITICAL puppet fail [02:26:17] PROBLEM - puppet last run on snapshot1003 is CRITICAL puppet fail [02:26:18] PROBLEM - puppet last run on cp1049 is CRITICAL puppet fail [02:26:18] PROBLEM - puppet last run on mw1012 is CRITICAL puppet fail [02:26:27] PROBLEM - puppet last run on db1053 is CRITICAL Puppet has 16 failures [02:26:27] PROBLEM - puppet last run on mw1200 is CRITICAL Puppet has 34 failures [02:26:28] PROBLEM - puppet last run on db2046 is CRITICAL Puppet has 19 failures [02:26:28] PROBLEM - puppet last run on cp1070 is CRITICAL Puppet has 11 failures [02:26:28] PROBLEM - puppet last run on analytics1033 is CRITICAL puppet fail [02:26:28] PROBLEM - puppet last run on dbproxy1002 is CRITICAL Puppet has 11 failures [02:26:28] PROBLEM - puppet last run on analytics1041 is CRITICAL Puppet has 12 failures [02:26:29] PROBLEM - puppet last run on mw2122 is CRITICAL Puppet has 68 failures [02:26:37] PROBLEM - puppet last run on es2005 is CRITICAL puppet fail [02:26:37] PROBLEM - puppet last run on mw2015 is CRITICAL puppet fail [02:26:38] PROBLEM - puppet last run on mw1028 is CRITICAL Puppet has 74 failures [02:26:38] PROBLEM - puppet last run on ms-be1004 is CRITICAL puppet fail [02:26:38] PROBLEM - puppet last run on logstash1001 is CRITICAL puppet fail [02:26:38] PROBLEM - puppet last run on mw1063 is CRITICAL puppet fail [02:26:47] PROBLEM - puppet last run on potassium is CRITICAL puppet fail [02:26:48] PROBLEM - puppet last run on mw1089 is CRITICAL Puppet has 36 failures [02:26:48] PROBLEM - puppet last run on analytics1040 is CRITICAL Puppet has 19 failures [02:26:48] PROBLEM - puppet last run on labstore1003 is CRITICAL puppet fail [02:26:48] PROBLEM - puppet last run on cp1055 is CRITICAL puppet fail [02:26:48] PROBLEM - puppet last run on mw2034 is CRITICAL Puppet has 79 failures [02:26:48] RECOVERY - RAID on snapshot1004 is OK no RAID installed [02:26:49] PROBLEM - puppet last run on cp1047 is CRITICAL puppet fail [02:26:49] PROBLEM - puppet last run on mw1174 is CRITICAL puppet fail [02:26:50] PROBLEM - puppet last run on netmon1001 is CRITICAL Puppet has 32 failures [02:26:57] PROBLEM - puppet last run on wtp2005 is CRITICAL puppet fail [02:26:57] PROBLEM - puppet last run on mw2199 is CRITICAL puppet fail [02:26:57] PROBLEM - puppet last run on dbstore2001 is CRITICAL Puppet has 13 failures [02:26:58] PROBLEM - puppet last run on heze is CRITICAL puppet fail [02:26:58] PROBLEM - puppet last run on mw2057 is CRITICAL Puppet has 75 failures [02:26:58] PROBLEM - puppet last run on mw1106 is CRITICAL Puppet has 77 failures [02:26:58] PROBLEM - puppet last run on mw1197 is CRITICAL Puppet has 30 failures [02:26:59] PROBLEM - puppet last run on mw1150 is CRITICAL puppet fail [02:26:59] PROBLEM - puppet last run on cp3012 is CRITICAL Puppet has 26 failures [02:27:00] PROBLEM - puppet last run on cp3045 is CRITICAL Puppet has 28 failures [02:27:07] PROBLEM - puppet last run on elastic1018 is CRITICAL puppet fail [02:27:08] PROBLEM - puppet last run on elastic1004 is CRITICAL puppet fail [02:27:08] PROBLEM - puppet last run on ms-be1003 is CRITICAL puppet fail [02:27:08] PROBLEM - puppet last run on elastic1021 is CRITICAL puppet fail [02:27:08] PROBLEM - puppet last run on es1008 is CRITICAL puppet fail [02:27:08] PROBLEM - puppet last run on mw1082 is CRITICAL Puppet has 76 failures [02:27:08] PROBLEM - puppet last run on carbon is CRITICAL puppet fail [02:27:09] PROBLEM - puppet last run on mw2108 is CRITICAL Puppet has 30 failures [02:27:09] PROBLEM - puppet last run on mw2076 is CRITICAL puppet fail [02:27:10] PROBLEM - puppet last run on mw2161 is CRITICAL Puppet has 70 failures [02:27:10] PROBLEM - puppet last run on mw2105 is CRITICAL puppet fail [02:27:17] PROBLEM - puppet last run on es2008 is CRITICAL puppet fail [02:27:17] PROBLEM - puppet last run on db2002 is CRITICAL puppet fail [02:27:17] PROBLEM - puppet last run on mw2014 is CRITICAL puppet fail [02:27:17] PROBLEM - puppet last run on es2002 is CRITICAL puppet fail [02:27:17] PROBLEM - puppet last run on lvs4002 is CRITICAL Puppet has 13 failures [02:27:18] PROBLEM - puppet last run on mw1100 is CRITICAL puppet fail [02:27:18] PROBLEM - puppet last run on labstore2001 is CRITICAL Puppet has 21 failures [02:27:19] PROBLEM - puppet last run on ms-be2006 is CRITICAL puppet fail [02:27:19] PROBLEM - puppet last run on mw1141 is CRITICAL Puppet has 34 failures [02:27:20] PROBLEM - puppet last run on helium is CRITICAL puppet fail [02:27:27] PROBLEM - puppet last run on db1066 is CRITICAL Puppet has 9 failures [02:27:27] PROBLEM - puppet last run on mw1187 is CRITICAL puppet fail [02:27:27] PROBLEM - puppet last run on db2045 is CRITICAL puppet fail [02:27:27] PROBLEM - puppet last run on ms-fe2004 is CRITICAL Puppet has 26 failures [02:27:28] PROBLEM - puppet last run on mc2003 is CRITICAL puppet fail [02:27:28] PROBLEM - puppet last run on db2067 is CRITICAL Puppet has 13 failures [02:27:28] PROBLEM - puppet last run on mw2145 is CRITICAL puppet fail [02:27:29] PROBLEM - puppet last run on db1022 is CRITICAL puppet fail [02:27:29] PROBLEM - puppet last run on wtp1006 is CRITICAL Puppet has 27 failures [02:27:37] PROBLEM - puppet last run on wtp1020 is CRITICAL puppet fail [02:27:37] PROBLEM - puppet last run on wtp1016 is CRITICAL Puppet has 15 failures [02:27:37] PROBLEM - puppet last run on mw1173 is CRITICAL puppet fail [02:27:37] PROBLEM - puppet last run on mw1045 is CRITICAL Puppet has 70 failures [02:27:37] PROBLEM - puppet last run on mw1242 is CRITICAL Puppet has 32 failures [02:27:38] PROBLEM - puppet last run on ms-be2003 is CRITICAL Puppet has 18 failures [02:27:38] PROBLEM - puppet last run on mw2124 is CRITICAL Puppet has 31 failures [02:27:39] PROBLEM - puppet last run on elastic1007 is CRITICAL Puppet has 24 failures [02:27:39] RECOVERY - puppet last run on cp1051 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [02:27:40] PROBLEM - puppet last run on mw1189 is CRITICAL puppet fail [02:27:40] PROBLEM - puppet last run on analytics1035 is CRITICAL Puppet has 19 failures [02:27:41] PROBLEM - puppet last run on db2059 is CRITICAL Puppet has 17 failures [02:27:41] PROBLEM - puppet last run on mc2004 is CRITICAL Puppet has 14 failures [02:27:42] PROBLEM - puppet last run on db2019 is CRITICAL Puppet has 17 failures [02:27:58] PROBLEM - puppet last run on mw1060 is CRITICAL puppet fail [02:27:58] PROBLEM - puppet last run on mw1254 is CRITICAL Puppet has 32 failures [02:27:58] PROBLEM - puppet last run on lvs3001 is CRITICAL puppet fail [02:27:58] PROBLEM - puppet last run on mc1003 is CRITICAL Puppet has 10 failures [02:27:58] PROBLEM - puppet last run on ms-fe1001 is CRITICAL Puppet has 12 failures [02:28:07] PROBLEM - puppet last run on mw1003 is CRITICAL Puppet has 23 failures [02:28:07] PROBLEM - puppet last run on mw1061 is CRITICAL puppet fail [02:28:07] PROBLEM - puppet last run on cp3010 is CRITICAL Puppet has 28 failures [02:28:07] PROBLEM - puppet last run on db1073 is CRITICAL Puppet has 21 failures [02:28:08] PROBLEM - puppet last run on lvs1005 is CRITICAL puppet fail [02:28:08] PROBLEM - puppet last run on mw1099 is CRITICAL puppet fail [02:28:08] PROBLEM - puppet last run on mw1164 is CRITICAL puppet fail [02:28:09] PROBLEM - puppet last run on mc2010 is CRITICAL puppet fail [02:28:09] PROBLEM - puppet last run on es2001 is CRITICAL puppet fail [02:28:10] PROBLEM - puppet last run on mw1153 is CRITICAL puppet fail [02:28:17] PROBLEM - puppet last run on wtp2001 is CRITICAL Puppet has 20 failures [02:28:17] PROBLEM - puppet last run on mw2173 is CRITICAL Puppet has 28 failures [02:28:17] PROBLEM - puppet last run on mw2123 is CRITICAL puppet fail [02:28:17] PROBLEM - puppet last run on mw2033 is CRITICAL Puppet has 75 failures [02:28:17] PROBLEM - puppet last run on mw2104 is CRITICAL puppet fail [02:28:18] PROBLEM - puppet last run on mw2045 is CRITICAL puppet fail [02:28:18] PROBLEM - puppet last run on analytics1020 is CRITICAL Puppet has 15 failures [02:28:26] there we go, that's the right bot to kill [02:28:53] I tested one node after the palladium apache2 restart and it succeeded, so I think it's just a matter of waiting through them all retrying again [02:29:00] no point enduring the spam meanwhile [02:29:15] !log LocalisationUpdate completed (1.26wmf2) at 2015-04-28 02:28:11+00:00 [02:29:20] Logged the message, Master [02:29:31] (I'm tailing the log on the host itself with puppet messages filtered, in case we miss anything else important) [02:35:31] bblack: \o/ [02:35:34] that bot needs love [02:35:48] I moved it to ops/puppet repo hoping someone would give it more love than it did when it was a deb for no reason [02:36:20] I've suggested the same improvement like 5 times in the past, but never made the effort myself either [02:37:06] (which is: when msg rate goes over say 10 messages in a 30s window or something like that, start suppressing messages with just a single line "300 alerts supressed" every minute or so, until the rate goes down again [02:37:27] subject to bikeshedding various time/count tunables [02:38:42] we can always go look at the icinga web UI to see things, the point of alerts is just to wake people up, IMHO [02:39:12] re: analytics, otto and I are chatting about it in gtalk for a bit now, he's aware and looking at things [02:39:23] aye, ja we shoudl chat in here [02:39:28] so, bblack, 1013 just came back up, right? [02:39:52] after you powercycled? [02:39:59] yeah [02:40:21] bblack: yeah. [02:40:25] bblack: agreed. [02:40:40] bblack: it’s something I’ve wanted to tackle for a while as well, but time. [02:41:52] there are so many Time -related songs, but when icinga + puppet are involved, I think Anthrax is probably the most appropriate one [02:41:57] https://www.youtube.com/watch?v=51vMbmIhyKk [02:43:04] 6operations, 10Analytics: analytics1013 crashed, investigate... - https://phabricator.wikimedia.org/T97380#1240688 (10Ottomata) Yeah this is very strange. This is the 4th node we have had this happen to in the last 2 weeks or so. (Well, 1016 happened today, and we are not sure that the same thing happened th... [02:43:36] heh :) [02:44:28] (because just like puppet+icinga, Anthrax feels like smashing your head against a brick wall, I guess) [02:45:24] 6operations, 6Labs, 10Tool-Labs, 7Monitoring: Add catchall tests for toollabs to catchpoint - https://phabricator.wikimedia.org/T97321#1240689 (10yuvipanda) Actually, after talking with @ori some more I think Catchpoint should play a much bigger role in measuring uptime of toollabs. I am thinking we'll exp... [02:47:42] ottomata: possibly also related? https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=analytics1021&service=Kafka+Broker+Messages+In [02:47:52] 6operations, 6Labs, 10Tool-Labs, 7Monitoring: Add catchall tests for toollabs to catchpoint - https://phabricator.wikimedia.org/T97321#1240691 (10yuvipanda) (using catchpoint for this might be overkill - and also its smallest resolution seems to be 5mins when I'd like this to be 1min) [02:47:59] "kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 1.00816183352" for past 1h45m or so [02:48:06] 6operations, 6Labs, 10Tool-Labs, 7Monitoring: Add catchall tests for toollabs to catchpoint - https://phabricator.wikimedia.org/T97321#1240692 (10yuvipanda) p:5Triage>3Normal [02:48:08] !log l10nupdate Synchronized php-1.26wmf3/cache/l10n: (no message) (duration: 08m 32s) [02:48:16] Logged the message, Master [02:48:50] also, ori's prediction from before the puppetspam: [02:48:51] 02:05 < ori> one of analytics10{11,13,14,15} will blow up next, going by ganglia [02:50:21] ah the kafka thing is separate... [02:50:43] also a regular annoyance i am waiting for some new hardware one day to try to fix that [02:50:53] !log 'kafka preferred-replica-election' [02:50:57] Logged the message, Master [02:51:31] bblack did ori predict that before 1013 blew up? [02:51:59] after [02:52:38] !log LocalisationUpdate completed (1.26wmf3) at 2015-04-28 02:51:34+00:00 [02:52:42] Logged the message, Master [02:54:34] bblack-mw: RECOVERY - Kafka Broker Messages In on analytics1021 is OK: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate OKAY: 2660.47422248 [02:54:41] oops -wm [02:55:04] spam seems to have settled down anyways, bringing back the bot [02:56:18] PROBLEM - puppet last run on neon is CRITICAL puppet fail [02:57:04] ^ that's just leftover from having neon's puppet disabled to keep the bot down [02:58:38] PROBLEM - puppet last run on cp4005 is CRITICAL Puppet last ran 4 hours ago [02:59:05] ^ and that's the last remaining leftover alert-trash from my cp* package upgrade process earlier [03:00:28] RECOVERY - puppet last run on cp4005 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [03:01:37] RECOVERY - puppet last run on neon is OK Puppet is currently enabled, last run 59 seconds ago with 0 failures [03:02:18] 6operations, 10Analytics: analytics1013 crashed, investigate... - https://phabricator.wikimedia.org/T97380#1240695 (10Ottomata) Note that so far, only the older of the Dells in the cluster have crashed. analytics1011-analytics1020 [03:02:45] ok bblack, i guess its fine for now. not good, but the cluster is doing everyhting it should when things crash [03:02:49] so that's good at least [03:02:52] now i'm going to crash [03:03:10] thanks for rebooting 1013, and thanks for the ping, glad I was nearby a computer to at least give it a second look [03:05:23] 6operations, 6Labs, 10Tool-Labs, 7Monitoring: Add catchall tests for toollabs to catchpoint - https://phabricator.wikimedia.org/T97321#1240699 (10yuvipanda) I could also just do this in prod icinga, fwiw. [03:44:01] mutante: (not super pressing right now, but...) there are some curious discrepancies between the article numbers at http://wikistats.wmflabs.org/display.php?t=wp and https://meta.wikimedia.org/w/index.php?title=List_of_Wikipedias/Table&curid=99149&diff=12048692&oldid=12041961 ... [03:44:21] ...the numbers for the wikipedias at #2-#6 are exactly identical,but [03:46:01] ...#1 (enwiki) differs a tiny bit, and the russian wp has 25k articles more [03:47:39] and the total count differs too: per the wmflabs page, we are just below the 35 million milestone, per the meta page (which i thought is just a copy of that) we passed it already [03:50:02] (and not sure it's related to the issues described at https://meta.wikimedia.org/wiki/User:Dcljr/Article_count_changes ....) [04:14:03] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: Force all Wikimedia cluster traffic to be over SSL for all users (logged-in and anon) - https://phabricator.wikimedia.org/T49832#1240750 (10Chmarkine) Can we start to force HTTPS for all users from the US soon? They should have low latency impact, since th... [04:21:34] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: Force all Wikimedia cluster traffic to be over SSL for all users (logged-in and anon) - https://phabricator.wikimedia.org/T49832#1240760 (10BBlack) Turning this on doesn't really go by-user-location, it goes by-wiki or groups of wikis (mostly because of HS... [04:27:38] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL host 208.80.154.196, interfaces up: 228, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-codfw:xe-5/2/1 (Telia, IC-307235) (#2648) [10Gbps wave]BR [04:29:17] RECOVERY - Router interfaces on cr1-eqiad is OK host 208.80.154.196, interfaces up: 230, down: 0, dormant: 0, excluded: 0, unused: 0 [04:33:22] ? [04:34:22] <^demon|away> cr1-eqiad? [04:35:33] i think there was some scheduled maintenance earlier, dunno [04:35:43] it's fine now so w/e [04:44:27] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL host 208.80.154.197, interfaces up: 214, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr2-codfw:xe-5/2/1 (Telia, IC-307236) (#3658) [10Gbps wave]BR [04:52:47] RECOVERY - Router interfaces on cr2-eqiad is OK host 208.80.154.197, interfaces up: 216, down: 0, dormant: 0, excluded: 0, unused: 0 [04:53:00] bblack: you are aware, yes? [04:53:16] telia link is flapping [05:05:23] (03PS1) 10KartikMistry: Added initial Debian packaging for apertium-eus [debs/contenttranslation/apertium-eus] - 10https://gerrit.wikimedia.org/r/207027 (https://phabricator.wikimedia.org/T96653) [05:09:43] blerg, whatever it is, it's affecting my bounce to irc as well :P [05:09:57] I need to go double-check, but no, I don't think we had planned maint on Telia [05:13:17] the fact that nothing else blipped makes me not too worried, yet [05:14:01] oh! [05:14:03] heh [05:14:36] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: Force all Wikimedia cluster traffic to be over SSL for all users (logged-in and anon) - https://phabricator.wikimedia.org/T49832#1240811 (10MZMcBride) >>! In T49832#1240760, @BBlack wrote: > Personally, I'm all for turning this on as quickly as we can and... [05:14:47] I guess when I saw alerts + you asking about it too, in my head I assumed this was one of our important-er links and sort of mentally did a s/codfw/ulsfo/ [05:14:57] those telia links are to codfw, which is not alive for users [05:15:15] so, don't worry [05:20:20] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: Force all Wikimedia cluster traffic to be over SSL for all users (logged-in and anon) - https://phabricator.wikimedia.org/T49832#1240813 (10BBlack) >>! In T49832#1240811, @MZMcBride wrote: > My understanding is that we could do a slow ramp-up by setting th... [05:29:30] <_joe_> good morning [05:31:51] !log LocalisationUpdate ResourceLoader cache refresh completed at Tue Apr 28 05:30:47 UTC 2015 (duration 30m 46s) [05:31:59] Logged the message, Master [05:44:19] _joe_: 'giorno :) [05:44:41] _joe_: can you maybe do one of your legendary load tests and dissolve any FUD on this matter? :) https://phabricator.wikimedia.org/T49832#1240811 [05:45:07] <_joe_> me? I think brandon and or.i are working pretty hard on that [05:45:27] <_joe_> I mean from what I understand the infrastructure can easily handle the load [05:45:31] No doubt. [05:45:38] PROBLEM - Host mw2027 is DOWN: PING CRITICAL - Packet loss = 100% [05:45:39] Ah. Great. [05:46:00] <_joe_> but with SSL for everything means higher latencies for users, and that's a given [05:46:14] <_joe_> so I think or.i is trying to quantify that [05:46:43] <_joe_> also, there are political considerations to keep in [05:46:58] <_joe_> like countries that block https [05:47:07] RECOVERY - Host mw2027 is UPING WARNING - Packet loss = 44%, RTA = 51.70 ms [05:47:22] <_joe_> and countries where people have such old computers that still run XP [05:47:34] For that we'd presumably follow the same exceptions as we did for login. [05:47:37] <_joe_> like I don't know, the Italian gobvernment :P [05:47:44] lol :p [05:48:07] <_joe_> Nemo_bis: I'm not joking, I'm pretty sure most of the italian public infrastructure still uses win XP [05:48:08] (03PS1) 10KartikMistry: Added initial Debian package for apertium-eu-en [debs/contenttranslation/apertium-eu-en] - 10https://gerrit.wikimedia.org/r/207031 (https://phabricator.wikimedia.org/T96653) [05:48:19] <_joe_> and win XP/IE will not be able to connect to us via HTTP [05:48:22] <_joe_> *HTTPS [05:48:28] The difference is that if those Italian state's employees can't access Wikimedia projects maybe we're happier. [05:48:46] Especially if that includes the head of the soprintendenza of Firenze. ;) [05:48:58] <_joe_> so, while I'm all for HTTPS-by-default, I do get it's a very complex matter [05:49:17] <_joe_> and not strictly technical at this point [05:49:24] * YuviPanda is just happy to do whatever prod does [05:51:42] (03PS3) 10KartikMistry: Added initial Debian package for apertium-en-gl [debs/contenttranslation/apertium-en-gl] - 10https://gerrit.wikimedia.org/r/206803 (https://phabricator.wikimedia.org/T96654) [05:55:31] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: Force all Wikimedia cluster traffic to be over SSL for all users (logged-in and anon) - https://phabricator.wikimedia.org/T49832#493804 (10Nemo_bis) [05:56:30] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: Force all Wikimedia cluster traffic to be over SSL for all users (logged-in and anon) - https://phabricator.wikimedia.org/T49832#493804 (10Nemo_bis) [05:56:32] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: HTTPS performance & UA adoption metrics - https://phabricator.wikimedia.org/T86664#1240857 (10Nemo_bis) [06:05:33] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: HTTPS performance & UA adoption metrics - https://phabricator.wikimedia.org/T86664#1240860 (10Nemo_bis) Thanks for creating this task. It would be even nice if the description mentioned *all* the research that is needed or ongoing before the big red switch... [06:30:18] PROBLEM - puppet last run on cp3016 is CRITICAL Puppet has 1 failures [06:30:28] PROBLEM - puppet last run on logstash1002 is CRITICAL puppet fail [06:30:48] PROBLEM - puppet last run on virt1006 is CRITICAL Puppet has 1 failures [06:30:59] PROBLEM - puppet last run on iron is CRITICAL Puppet has 1 failures [06:31:18] PROBLEM - puppet last run on elastic1027 is CRITICAL Puppet has 1 failures [06:31:28] PROBLEM - puppet last run on wtp2015 is CRITICAL Puppet has 1 failures [06:32:08] PROBLEM - puppet last run on db1050 is CRITICAL Puppet has 1 failures [06:32:27] PROBLEM - puppet last run on mw1153 is CRITICAL Puppet has 1 failures [06:32:27] PROBLEM - puppet last run on ms-fe2001 is CRITICAL Puppet has 1 failures [06:33:08] PROBLEM - puppet last run on mw2013 is CRITICAL Puppet has 1 failures [06:33:28] <_joe_> oh it's 8:30 [06:33:48] <_joe_> mod_passenger remembers me it's time for real work [06:34:48] PROBLEM - puppet last run on mw1100 is CRITICAL Puppet has 1 failures [06:35:07] PROBLEM - puppet last run on mw2134 is CRITICAL Puppet has 1 failures [06:35:07] PROBLEM - puppet last run on mw2126 is CRITICAL Puppet has 2 failures [06:35:57] PROBLEM - puppet last run on mw2066 is CRITICAL Puppet has 1 failures [06:36:27] PROBLEM - puppet last run on mw2050 is CRITICAL Puppet has 1 failures [06:42:47] (03PS10) 10KartikMistry: Added initial Debian packaging [debs/contenttranslation/apertium-fr-es] - 10https://gerrit.wikimedia.org/r/195577 (https://phabricator.wikimedia.org/T92252) [06:45:28] RECOVERY - puppet last run on cp3016 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:45:37] RECOVERY - puppet last run on db1050 is OK Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:45:48] RECOVERY - puppet last run on mw1153 is OK Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:45:49] RECOVERY - puppet last run on virt1006 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [06:45:58] RECOVERY - puppet last run on mw2066 is OK Puppet is currently enabled, last run 40 seconds ago with 0 failures [06:46:07] RECOVERY - puppet last run on iron is OK Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:46:18] RECOVERY - puppet last run on elastic1027 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:29] RECOVERY - puppet last run on mw1100 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:29] RECOVERY - puppet last run on mw2013 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:46:47] RECOVERY - puppet last run on mw2134 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:46:48] RECOVERY - puppet last run on mw2126 is OK Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:47:08] RECOVERY - puppet last run on logstash1002 is OK Puppet is currently enabled, last run 45 seconds ago with 0 failures [06:47:27] RECOVERY - puppet last run on ms-fe2001 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:08] RECOVERY - puppet last run on mw2050 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:48:09] RECOVERY - puppet last run on wtp2015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [06:53:18] (03PS1) 10KartikMistry: Added initial Debian packaging for apertium-eu-es [debs/contenttranslation/apertium-eu-es] - 10https://gerrit.wikimedia.org/r/207038 (https://phabricator.wikimedia.org/T96653) [06:57:46] akosiaris: ping me when around :) [06:58:13] (03CR) 10Yuvipanda: "*bump*" [puppet] - 10https://gerrit.wikimedia.org/r/199267 (https://phabricator.wikimedia.org/T85606) (owner: 10coren) [07:10:15] (03CR) 10Muehlenhoff: "Generally looks good to me, but debian/changelog mentions the package has been converted to debhelper compat level 9, while it in fact use" [debs/contenttranslation/apertium-fr-es] - 10https://gerrit.wikimedia.org/r/195577 (https://phabricator.wikimedia.org/T92252) (owner: 10KartikMistry) [07:11:30] (03CR) 10Aldnonymous: [C: 031] Add abusefilter-modify-restricted right to sysop user group for idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206080 (https://phabricator.wikimedia.org/T96542) (owner: 10Mjbmr) [07:28:02] (03PS1) 10Merlijn van Deen: Extend Exim diamond collector for Tool Labs [puppet] - 10https://gerrit.wikimedia.org/r/207043 (https://phabricator.wikimedia.org/T96898) [07:30:55] (03CR) 10Merlijn van Deen: "Yep, that's what happens if you first test on system A, fix bugs, then on system B, fix other bugs and inadvertently break things for syst" [puppet] - 10https://gerrit.wikimedia.org/r/207043 (https://phabricator.wikimedia.org/T96898) (owner: 10Merlijn van Deen) [07:38:01] (03CR) 10Kenrick95: [C: 031] Add abusefilter-modify-restricted right to sysop user group for idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206080 (https://phabricator.wikimedia.org/T96542) (owner: 10Mjbmr) [08:00:14] bblack, or any op here? [08:02:48] i am going to rename https://meta.wikimedia.org/wiki/Special:CentralAuth/SatuSuro [08:03:08] hunderthousands of edits, need OK from ops [08:03:28] (ping: Legoktm, _joe_ apergos) [08:06:02] no tech around? :-/ [08:08:44] (03PS1) 10KartikMistry: Added initial Debian packaging for apertium-es-an [debs/contenttranslation/apertium-es-an] - 10https://gerrit.wikimedia.org/r/207045 (https://phabricator.wikimedia.org/T96651) [08:09:20] :'-( [08:12:11] Steinsplitter, go for it. you don't really need to ask for our permissions [08:12:34] ok :) [08:13:25] i don't think you can do it without the bigmove right, but correct me if i'm wrong [08:14:07] Jobs to rename SatuSuro to JarrahTree have been queued [08:14:09] :O [08:15:08] Steinsplitter: sorry, I missed your ping [08:16:34] Steinsplitter: it seems like you broke his account [08:16:40] https://meta.wikimedia.org/wiki/Special:CentralAuth/SatuSuro [08:17:05] https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/JarrahTree says in progress [08:17:17] let's wat for it to complete shall we? [08:17:52] * matanya crosses fingers [08:18:10] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: Force all Wikimedia cluster traffic to be over SSL for all users (logged-in and anon) - https://phabricator.wikimedia.org/T49832#1241029 (10Chmarkine) >>! In T49832#1240813, @BBlack wrote: > As I've stated before, personally I'd prefer to do the hard redir... [08:18:25] if we get really stuck we can reassign edits for enwiki va the maintenance script [08:18:33] (03PS1) 10KartikMistry: Added initial Debian package for apertium-es-ast [debs/contenttranslation/apertium-es-ast] - 10https://gerrit.wikimedia.org/r/207046 (https://phabricator.wikimedia.org/T96652) [08:19:22] apergos: i tink so. the documentation says to notify/ask sysadmins when renaming accounts with +50000 edits [08:19:25] maybe because of this [08:19:51] well let's give it some time [08:24:06] (03PS1) 10KartikMistry: CX: Add Czech, Greek, Kazakh and Zulu [puppet] - 10https://gerrit.wikimedia.org/r/207047 (https://phabricator.wikimedia.org/T96486) [08:28:07] (03PS1) 10KartikMistry: Enable Content Translation in cs, el, kk and zu [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207048 (https://phabricator.wikimedia.org/T96486) [08:28:32] apergos: are the rename jobs workig currently? [08:28:52] I assume so [08:30:49] don't worry Steinsplitter, it will take a while [08:30:59] ok :) thx [08:32:27] PROBLEM - puppet last run on mw2136 is CRITICAL Puppet has 1 failures [08:33:42] if the job fails we shuld see 'failed' in the progress report [08:34:13] i the meantime the job runners have a lot of things queued like parsoid cache updates, restbase changes, search stuff, etc... [08:34:41] (my how the jobrunner code has changed since I looked at it last!) [08:36:25] really? the jobq dashboard is broken? lucky I'm on one of the hosts then... [08:37:16] 6operations, 6Labs, 10Tool-Labs, 7Monitoring: Add catchall tests for toollabs to catchpoint - https://phabricator.wikimedia.org/T97321#1241053 (10valhallasw) p:5Normal>3Low Maybe Icinga for local 1-min resolution monitoring and Catchpoint for worldwide monitoring at a lower resolution? I think a 5 min... [08:49:17] RECOVERY - puppet last run on mw2136 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [08:52:36] if we are using the default setting for wgUpdateRowsPerJob then we've got 2000 jobs to wait for... oughta check back much later in the ay [08:52:37] day [08:53:54] apergos: on every wiki has been processed, exept enwiki. :-/ [08:54:01] well yes [08:54:07] because that's the 100k edits :-D [08:54:12] ok :-D [08:54:54] (03CR) 10KartikMistry: [C: 04-1] "Not to deploy before https://gerrit.wikimedia.org/r/207048" [puppet] - 10https://gerrit.wikimedia.org/r/207047 (https://phabricator.wikimedia.org/T96486) (owner: 10KartikMistry) [08:59:40] (03CR) 10Hashar: [C: 031] integration - Raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/206981 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [08:59:45] (03CR) 10Hashar: [C: 031] doc - Raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/206980 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [09:10:35] (03PS1) 10Mobrovac: mobileapps service: Role and module for SCA [puppet] - 10https://gerrit.wikimedia.org/r/207050 (https://phabricator.wikimedia.org/T92627) [09:10:51] (03PS1) 10Mobrovac: mobileapps service: LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/207051 (https://phabricator.wikimedia.org/T92627) [09:11:36] (03PS1) 10Mobrovac: mobileapps service: Varnish / parsoidcache configuration [puppet] - 10https://gerrit.wikimedia.org/r/207052 (https://phabricator.wikimedia.org/T92627) [09:13:10] hashar: sigh. I missed your last ping on betawiki creation task. [09:14:11] akosiaris: mobileapps service patches ready - https://gerrit.wikimedia.org/r/#/c/207050/ and deps [09:14:18] mobrovac: thanks! [09:14:52] akosiaris: as usual, you need to do lvs ip and dns [09:15:27] heh, yeah that should be easy [09:16:07] cool, thnx [09:17:19] Steinsplitter: finished ok [09:17:59] <3 [09:18:16] apergos: FYI [09:18:21] thanks apergos, MaxSem, matanya :) [09:23:44] (03PS1) 10Alexandros Kosiaris: Rearrange/refactor some check_http commands [puppet] - 10https://gerrit.wikimedia.org/r/207053 [09:25:59] yay [09:29:26] (03CR) 10Hashar: [C: 04-1] "Can you check whether the device parameter can be dropped entirely?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/204542 (owner: 10Krinkle) [09:30:06] (03PS12) 10Hashar: contint: move zuul_merger_hosts to hiera, use in ferm [puppet] - 10https://gerrit.wikimedia.org/r/201882 (https://phabricator.wikimedia.org/T87519) (owner: 10Dzahn) [09:31:51] (03CR) 10Hashar: [C: 031] contint: move zuul_merger_hosts to hiera, use in ferm [puppet] - 10https://gerrit.wikimedia.org/r/201882 (https://phabricator.wikimedia.org/T87519) (owner: 10Dzahn) [09:31:54] 6operations, 10Mathoid-General-or-Unknown, 6Services: Standardise Mathoid's deployment - https://phabricator.wikimedia.org/T97124#1241087 (10mobrovac) The deploy repository has been [requested](https://www.mediawiki.org/wiki/Git/New_repositories/Requests) [09:37:40] (03CR) 10Alexandros Kosiaris: [C: 032] Rearrange/refactor some check_http commands [puppet] - 10https://gerrit.wikimedia.org/r/207053 (owner: 10Alexandros Kosiaris) [09:39:20] (03CR) 10Hashar: "+1 :)" [puppet] - 10https://gerrit.wikimedia.org/r/207053 (owner: 10Alexandros Kosiaris) [09:42:10] 6operations, 10Mathoid-General-or-Unknown, 6Services: Standardise Mathoid's deployment - https://phabricator.wikimedia.org/T97124#1241093 (10Physikerwelt) Can you add me to the repo? I.e. pushing the tags is not possible with gerrit alone [09:43:47] PROBLEM - Check correctness of the icinga configuration on neon is CRITICAL: Icinga configuration contains errors, please check! [09:45:17] ^ known, that is me [09:47:32] 6operations, 10Mathoid-General-or-Unknown, 6Services: Standardise Mathoid's deployment - https://phabricator.wikimedia.org/T97124#1241096 (10Physikerwelt) @mobrovac I recently tried to build with ppa with a makefile that calls npm install. However this still fails with a out of memory extension... Therefore... [09:48:28] 6operations, 10Mathoid-General-or-Unknown, 6Services: Standardise Mathoid's deployment - https://phabricator.wikimedia.org/T97124#1241097 (10mobrovac) >>! In T97124#1241093, @Physikerwelt wrote: > Can you add me to the repo? I.e. pushing the tags is not possible with gerrit alone Which one? :) services/math... [09:50:33] 6operations, 10Mathoid-General-or-Unknown, 6Services: Standardise Mathoid's deployment - https://phabricator.wikimedia.org/T97124#1241098 (10mobrovac) >>! In T97124#1241096, @Physikerwelt wrote: > @mobrovac I recently tried to build with ppa with a makefile that calls npm install. However this still fails wi... [09:56:16] 6operations, 10Mathoid-General-or-Unknown, 6Services: Standardise Mathoid's deployment - https://phabricator.wikimedia.org/T97124#1241104 (10Physikerwelt) Thank you very much for the excellent documentation. This simplifies the process significantly (since I don't have to care about ppa debian repos). Origin... [10:00:03] (03Abandoned) 10Hashar: role::ci::website::labs [puppet] - 10https://gerrit.wikimedia.org/r/173251 (owner: 10Hashar) [10:05:03] RECOVERY - Check correctness of the icinga configuration on neon is OK: Icinga configuration is correct [10:08:57] (03PS2) 10Filippo Giunchedi: statsite: improve restart [puppet] - 10https://gerrit.wikimedia.org/r/206819 [10:17:58] (03CR) 10Hashar: "I tested akosiaris suggestion and it solves the puppet dependency issue !" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/203073 (https://phabricator.wikimedia.org/T95545) (owner: 10Hashar) [10:20:40] (03PS7) 10Hashar: contint: make Jessie slaves package builders [puppet] - 10https://gerrit.wikimedia.org/r/203073 (https://phabricator.wikimedia.org/T95545) [10:20:53] (03PS8) 10Hashar: contint: make Jessie slaves package builders [puppet] - 10https://gerrit.wikimedia.org/r/203073 (https://phabricator.wikimedia.org/T95545) [10:21:45] (03CR) 10Iwan Novirion: [C: 031] "Patch Set 1: Code-Review+1" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206080 (https://phabricator.wikimedia.org/T96542) (owner: 10Mjbmr) [10:21:53] (03CR) 10Hashar: [C: 031] "PS7 adds a system::role" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/203073 (https://phabricator.wikimedia.org/T95545) (owner: 10Hashar) [10:28:45] (03PS1) 10Alexandros Kosiaris: service: Add healthcheck_url parameter, ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/207057 [10:29:02] (03CR) 10Hashar: [V: 032] "I can confirm the patch works just fine now!" [puppet] - 10https://gerrit.wikimedia.org/r/203073 (https://phabricator.wikimedia.org/T95545) (owner: 10Hashar) [10:29:21] (03CR) 10jenkins-bot: [V: 04-1] service: Add healthcheck_url parameter, ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/207057 (owner: 10Alexandros Kosiaris) [10:42:20] 6operations, 7Graphite, 5Patch-For-Review: Counters now only provide rates (multiplied by 1000?) - https://phabricator.wikimedia.org/T95703#1241224 (10fgiunchedi) correction, holding this while statsite restart is improved in https://gerrit.wikimedia.org/r/#/c/206819/ [10:44:06] (03PS2) 10Alexandros Kosiaris: service: Add healthcheck_url parameter, ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/207057 [10:44:42] (03CR) 10jenkins-bot: [V: 04-1] service: Add healthcheck_url parameter, ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/207057 (owner: 10Alexandros Kosiaris) [10:50:19] 6operations, 7Graphite, 5Patch-For-Review: revisit what percentiles are calculated by statsite - https://phabricator.wikimedia.org/T88662#1241227 (10fgiunchedi) package uploaded, pending more reliable statsite restart https://gerrit.wikimedia.org/r/#/c/206819/ [11:06:55] 6operations, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Isolation, 7Nodepool: Use systemd for Nodepool - https://phabricator.wikimedia.org/T96867#1241257 (10hashar) From T95003 @joe pointed to our puppet define `base::service_unit` which maybe a good template to add systemd to nodepoo... [11:08:28] (03PS3) 10Alexandros Kosiaris: service: Add healthcheck_url parameter, ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/207057 [11:08:52] 6operations: Make services manageable by systemd (tracking) - https://phabricator.wikimedia.org/T97402#1241261 (10hashar) 3NEW [11:09:49] 6operations: Make services manageable by systemd (tracking) - https://phabricator.wikimedia.org/T97402#1241269 (10hashar) + @MoritzMuehlenhoff from ops who have experience with systemd! [11:10:48] 6operations, 5Patch-For-Review: Convert ircecho init script to a systemd unit - https://phabricator.wikimedia.org/T95055#1241275 (10hashar) [11:10:51] 6operations, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Isolation, 7Nodepool: Use systemd for Nodepool - https://phabricator.wikimedia.org/T96867#1241273 (10hashar) [11:10:53] 6operations: Switch ganglia aggregator init stuff to systemd on jessie - https://phabricator.wikimedia.org/T96842#1241274 (10hashar) [11:12:04] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL 7.14% of data above the critical threshold [500.0] [11:12:44] PROBLEM - puppet last run on cp3009 is CRITICAL puppet fail [11:14:44] PROBLEM - puppet last run on ms-fe3002 is CRITICAL Puppet has 1 failures [11:15:48] 6operations: Make services manageable by systemd (tracking) - https://phabricator.wikimedia.org/T97402#1241292 (10mobrovac) Also note that there is base::service_unit define in ops/puppet which lets you pick the init script based on the current distro, which could ease transition. [11:23:54] RECOVERY - HTTP 5xx req/min on graphite1001 is OK Less than 1.00% above the threshold [250.0] [11:29:34] RECOVERY - puppet last run on cp3009 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [11:29:54] RECOVERY - puppet last run on ms-fe3002 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [11:52:41] 6operations: Make services manageable by systemd (tracking) - https://phabricator.wikimedia.org/T97402#1241341 (10hashar) [11:54:32] (03PS1) 10Faidon Liambotis: smokeping: add codfw [puppet] - 10https://gerrit.wikimedia.org/r/207060 [11:55:05] (03CR) 10Faidon Liambotis: [C: 032 V: 032] smokeping: add codfw [puppet] - 10https://gerrit.wikimedia.org/r/207060 (owner: 10Faidon Liambotis) [12:02:17] 6operations: Java 8 for Jessie - https://phabricator.wikimedia.org/T97406#1241350 (10MoritzMuehlenhoff) 3NEW [12:07:53] 6operations, 10Traffic, 7HTTPS, 5HTTPS-by-default: Force all Wikimedia cluster traffic to be over SSL for all users (logged-in and anon) - https://phabricator.wikimedia.org/T49832#1241360 (10BBlack) My point about realtime here is that when we first turn on a 302 for a wiki or group of wikis, we can see th... [12:13:23] PROBLEM - puppet last run on mw2085 is CRITICAL Puppet has 1 failures [12:13:42] (03PS1) 10Alexandros Kosiaris: url_downloader: Fix typos in networking rules [puppet] - 10https://gerrit.wikimedia.org/r/207061 [12:14:54] 6operations, 10Traffic: Evaluate Apache Traffic Server - https://phabricator.wikimedia.org/T96853#1241361 (10BBlack) p:5Low>3Normal [12:15:15] 6operations, 10Wikidata, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 5Patch-For-Review: Create Wikipedia Konkani - https://phabricator.wikimedia.org/T96468#1241364 (10Springle) >>! In T96468#1222975, @Krenair wrote: > * Sanitarium config (@Coren/@Springle? This needs your confirmation) Sanita... [12:18:22] (03CR) 10Alexandros Kosiaris: [C: 032] url_downloader: Fix typos in networking rules [puppet] - 10https://gerrit.wikimedia.org/r/207061 (owner: 10Alexandros Kosiaris) [12:25:16] 6operations: Java 8 for Jessie - https://phabricator.wikimedia.org/T97406#1241365 (10Manybubbles) We'll need Java 8 available Mediawiki-vagrant and Jenkins around the time we start getting Cirrus working with 2.0. 2.0 doesn't have a release date and we don't jump until Elasticsearch has been in 2.0 for a month a... [12:28:33] RECOVERY - puppet last run on mw2085 is OK Puppet is currently enabled, last run 28 seconds ago with 0 failures [12:28:36] (03CR) 10Mobrovac: [C: 04-1] "LGTM, modulo small in-lined comment." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/207057 (owner: 10Alexandros Kosiaris) [12:31:07] (03PS4) 10Nemo bis: Set $wgRateLimits['badcaptcha'] to counter bots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195886 (https://phabricator.wikimedia.org/T92376) [12:34:03] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [12:34:18] !log xtrabackup clone db2018 to db2043 [12:34:27] Logged the message, Master [12:36:53] !log xtrabackup clone db2019 to db2044 [12:36:56] Logged the message, Master [12:38:27] !log xtrabackup clone db2023 to db2045 [12:38:30] Logged the message, Master [12:39:26] springle: You're around? [12:39:35] PROBLEM - puppet last run on cp3017 is CRITICAL puppet fail [12:39:55] hoo: evidently ;) [12:40:16] 6operations, 10Graphoid, 6Services, 10service-template-node, and 2 others: Deploy graphoid service into production - https://phabricator.wikimedia.org/T90487#1241425 (10mobrovac) [12:40:34] springle: When doing Wikidata json dumps I noticed that one of the shards was significantly slower, although they have equally distributed data in all means [12:40:46] so I poked at that a bit and found out it's because the otehrs were hitting db1071 [12:40:52] and that was db1049 [12:41:12] Which is way slower, thus slowing down the overall process, for that shard [12:42:04] hoo: db1071 is twice the box db1049 is [12:42:24] 64GB R510 vs 160GB R710 [12:42:34] hoo: also, what dumps are these? [12:42:38] That makes sense, yes [12:43:05] On these boxes we only run one query but that one fairly often [12:43:29] SELECT ... rev_id,rev_content_format,rev_timestamp,page_latest,old_id,old_text,old_flags,epp_entity_id,epp_entity_type FROM `wb_entity_per_page` INNER JOIN `page` ON ((epp_page_id=page_id)) INNER JOIN `revision` ON ((page_latest=rev_id)) INNER JOIN `text` ON ((old_id=rev_text_id)) WHERE (epp_entity_id = '18471985' AND epp_entity_type = 'item') OR (epp_entity_id = '18471986' AND epp_entity_type = 'item') OR ... [12:44:13] We also do that query during normal operation, eg, when fetching things for the API [12:45:00] does it use a particulat query group? 'api [12:45:07] 'api' presumably [12:45:11] No, none [12:45:23] That's why it ends up on any of the slaves with a non zero load [12:45:51] it should probably use 'api' or 'dump' [12:46:12] which for s5 ends up on db1045, usually not heavily loaded [12:46:18] mh [12:46:54] Thinking about that, no, not really... we hit that code in one form or another whenever we need to load an entity [12:46:58] either from cache or from ES [12:47:10] ok, fair enough [12:47:25] mh... what about playing with the load? [12:47:33] how so? [12:47:48] Giving the 64gb slaves 4/5 of the load of db1071 might be to much for them [12:48:08] although changing it could have the primary effect of making db1071 slower, not really the others faster [12:48:12] thing is, 4/5 of the load is mostly simple selects [12:48:50] as this is s5 only, it might just be we need to upgrade s5 slaves. use more R710s [12:49:04] s5 is always a bit special [12:49:50] hoo: what /* comment */ is used for the query ^ ? [12:50:01] yeah... I guess it's mainly the amount of ram that makes db1071 unbeatable for these simple queries [12:50:21] It's from the Wikibase\Lib\Store\Sql\WikiPageEntityMetaDataLookup class [12:50:27] Give me a sec to fetch a link [12:51:24] springle: https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/master/lib/includes/store/sql/WikiPageEntityMetaDataLookup.php [12:51:52] we plan to upgrade the EQIAD R510s fleet in the next FY, once CODFW is done [12:52:08] There's an inprocess caching/prefetching layer on top of it where we try to accumulate the infos we need and fetch them at ones: https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/master/lib/includes/store/sql/WikiPageEntityMetaDataLookup.php [12:52:14] * https://github.com/wikimedia/mediawiki-extensions-Wikibase/blob/master/lib/includes/store/sql/PrefetchingWikiPageEntityMetaDataAccessor.php [12:52:23] springle: Awesome :) [12:58:04] RECOVERY - puppet last run on cp3017 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [12:58:23] hoo: in the meantime, if you introduce some wikidata wfGetDB group names, we could better control load [12:58:55] Mh... you mean for stuff that touches the wikibase specific tables? [12:59:01] yes [12:59:02] We could do that [13:00:04] aude: Respected human, time to deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150428T1300). Please do the needful. [13:00:30] We have a deploy slot? Wuut? [13:01:19] Oh, usage tracking [13:01:42] springle: I'll open a ticket in a bit, that's a good idea I think [13:01:55] :) [13:01:55] great [13:02:33] springle: did you see our wbc_entity_usage schema change task? [13:03:19] https://phabricator.wikimedia.org/T95179 [13:03:37] * aude doesn't feel comfortable to do it myself with osc [13:03:53] for wikidatawiki [13:04:16] is that a blocker for you now? [13:04:42] it's not but should be done at some point soonish [13:04:47] before we forget [13:06:57] aude: please link schema changes to T51188, and bump priority [13:07:04] springle: ok [13:09:08] (03PS1) 10Alexandros Kosiaris: Add public LVS network ranges in network.pp [puppet] - 10https://gerrit.wikimedia.org/r/207070 [13:09:33] done [13:09:39] thanks! [13:10:30] a tracking bug seems so ... bugzilla-esque. i suppose we should do something more phab [13:14:00] Workboard! [13:14:01] (03CR) 10coren: [C: 031] "This is ready to deploy for cleanup." [puppet] - 10https://gerrit.wikimedia.org/r/203864 (https://phabricator.wikimedia.org/T95555) (owner: 10coren) [13:15:31] 6operations, 10Traffic: Reboot caches for kernel 3.19.3 globally - https://phabricator.wikimedia.org/T96854#1241487 (10BBlack) Had a chat with @MoritzMuehlenhoff about the kernel issues. He's convinced me we should stick with the 3.19 series for the foreseeable future (with an eye towards eventually adopting... [13:15:32] springle: convert to a project :) See https://www.mediawiki.org/wiki/Phabricator/Project_management/Tracking_tasks [13:15:55] (03CR) 10Giuseppe Lavagetto: [C: 031] "Apart from the small comment Marko made, this LGTM" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/207057 (owner: 10Alexandros Kosiaris) [13:16:17] 6operations, 10Traffic: Build a non-trunk 3.19 kernel for jessie - https://phabricator.wikimedia.org/T97411#1241490 (10BBlack) 3NEW a:3MoritzMuehlenhoff [13:17:07] (03CR) 10Giuseppe Lavagetto: [C: 031] "I think network.pp is one of the few puppet files I'd preserve in the long run. And even if we won't, for now it needs to be complete." [puppet] - 10https://gerrit.wikimedia.org/r/207070 (owner: 10Alexandros Kosiaris) [13:17:39] hmmm [13:17:48] * aude needs to update my deployment key [13:20:11] (03CR) 10Alexandros Kosiaris: service: Add healthcheck_url parameter, ferm::service (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/207057 (owner: 10Alexandros Kosiaris) [13:22:24] (03PS4) 10Alexandros Kosiaris: service: Add healthcheck_url parameter, ferm::service [puppet] - 10https://gerrit.wikimedia.org/r/207057 [13:27:34] (03CR) 10BBlack: [C: 031] Add public LVS network ranges in network.pp [puppet] - 10https://gerrit.wikimedia.org/r/207070 (owner: 10Alexandros Kosiaris) [13:30:03] andre__: sounds like a plan [13:31:56] (03CR) 10Alexandros Kosiaris: [C: 032] "Comments addressed, merging." [puppet] - 10https://gerrit.wikimedia.org/r/207057 (owner: 10Alexandros Kosiaris) [13:33:02] (03CR) 10Alexandros Kosiaris: [C: 032] Add public LVS network ranges in network.pp [puppet] - 10https://gerrit.wikimedia.org/r/207070 (owner: 10Alexandros Kosiaris) [13:34:34] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [13:37:12] (03PS1) 10Aude: Update my (aude's) ssh key [puppet] - 10https://gerrit.wikimedia.org/r/207079 [13:39:15] _joe_: can you (or anyone) please review https://gerrit.wikimedia.org/r/#/c/207079/ [13:39:46] maybe hoo can +1 since he can verify situation :) [13:40:31] PROBLEM - puppet last run on palladium is CRITICAL puppet fail [13:40:49] (03CR) 10Hoo man: [C: 031] "I can confirm that Katie left her Macbook in Germany and bought a new ultrabook on Saturday" [puppet] - 10https://gerrit.wikimedia.org/r/207079 (owner: 10Aude) [13:40:54] thanks :) [13:41:50] aude: Got your Ubuntu all set up now? Is it going well with the drivers? [13:41:51] RECOVERY - puppet last run on palladium is OK Puppet is currently enabled, last run 18 seconds ago with 0 failures [13:42:01] hoo: all good so far [13:42:14] but things like forgot to install composer before going on the airplane [13:42:20] RECOVERY - graphoid on sca1001 is OK: HTTP OK: HTTP/1.1 200 OK - 879 bytes in 0.016 second response time [13:42:29] uh :S [13:42:37] didn't do much on the airplane [13:43:08] I don't think I know anyone who actually manages to be productive on a plane [13:43:17] if i had things installed [13:43:26] * aude checked out mediawiki + submodules [13:43:31] that gives me the build [13:43:31] I'm already happy if I can fall asleep :P [13:43:49] which is not exactly useful unless i composer install --prefer-source [13:43:53] Will you branch today, or shall I later on? [13:43:57] i can [13:43:58] yeah, you need all git repos [13:44:00] ok :) [13:44:15] <_joe_> aude: I'll merge in a few [13:44:19] and don't have phpunit yet :o [13:44:21] _joe_: thanks [13:44:28] <_joe_> aude: did you change your key on gerit too? [13:44:30] <_joe_> *r [13:44:33] _joe_: i did [13:45:00] PROBLEM - Host analytics1015 is DOWN: PING CRITICAL - Packet loss = 100% [13:48:32] (03PS2) 10Giuseppe Lavagetto: Update my (aude's) ssh key [puppet] - 10https://gerrit.wikimedia.org/r/207079 (owner: 10Aude) [13:48:53] (03CR) 10Giuseppe Lavagetto: [C: 032] Update my (aude's) ssh key [puppet] - 10https://gerrit.wikimedia.org/r/207079 (owner: 10Aude) [13:48:58] (03PS3) 10Ottomata: Add alerts for missing hours in pagecounts_all_sites and pagecounts_raw [puppet] - 10https://gerrit.wikimedia.org/r/205067 (owner: 10QChris) [13:49:02] 1015!? [13:49:13] ha [13:49:18] (03CR) 10Ottomata: [C: 032 V: 032] Add alerts for missing hours in pagecounts_all_sites and pagecounts_raw [puppet] - 10https://gerrit.wikimedia.org/r/205067 (owner: 10QChris) [13:49:19] thanks [13:50:13] <_joe_> aude: in about 20 mins that's gonna be propagated [13:50:17] ok [13:50:24] <_joe_> ottomata: analytics1015 is you? [13:50:40] * aude will do what i can with deploying usage tracking until swat [13:50:46] not on purpose, no, but the older dells have been individual crashing over the last 2 weeks, not yet sure why [13:50:55] 3 in the last 24 hours, so it is more than usual now [13:51:04] mostly adding tables and populating them and actually enable stuff later [13:51:04] most of them are just fine after reboot [13:51:14] !log powercycling analytics1015 after crash [13:51:19] Logged the message, Master [13:51:21] (starting with the subscription tracking part on wikidata) [13:54:30] RECOVERY - Host analytics1015 is UPING OK - Packet loss = 0%, RTA = 2.23 ms [13:58:40] PROBLEM - puppet last run on mw2020 is CRITICAL Puppet has 1 failures [14:16:33] RECOVERY - puppet last run on mw2020 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [14:23:43] PROBLEM - puppet last run on snapshot1004 is CRITICAL puppet fail [14:27:24] Krinkle: Going to prepare the extension-update-in-core changes for your SWAT patches this morning? [14:28:00] anomie: Aye, I thought SWAT usually do that. No problem, will do. [14:28:09] 6operations, 10Graphoid, 6Services, 10service-template-node, and 2 others: Deploy graphoid service into production - https://phabricator.wikimedia.org/T90487#1241635 (10Yurik) Diagram has been added to the https://www.mediawiki.org/w/index.php?title=Extension:Graph#Graphoid_Service [14:28:16] anomie: They'd have to be merged first,right? [14:28:21] Krinkle: Yes [14:28:42] But I can't merge without intent to deploy right-after per policy. [14:28:46] So hence I thought swat usually do this [14:29:23] Thx for the heads up :) [14:29:24] I don't know what the evening window does, but https://wikitech.wikimedia.org/wiki/SWAT_deploys says "For extension fixes, the SWAT team prefers that the requestor submits a gerrit change to core that "bumps" the extension submodule to incorporate the fix (see How to deploy code#Updating the submodule)." [14:31:00] As far as I know, the "intent to deploy" mainly applies to the operations/mediawiki-config repo, and to a lesser extent mediawiki/core, while the extension branches are mostly not. That's a good point though. [14:31:56] anomie: I noticed recently that a fair number of people seem to do very strange ways of extension updating, resulting in dirty git on tin all the time. I'd like to double check if I'm up to date with current practices. [14:32:40] E.g. I see a few times that people cherry-pick master commits to a wmf branch and then deploy that (presumably using git-pull), thus making the reference no longer in sync. [14:33:12] And in those same times also the wmf branch would no longer include the "Creat wmf branch commit" suggesting the user either force pushed, or that the branch was mal-created. [14:34:27] https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Case_1b:_extension_changes should be the correct process. If someone's updating the extension-reference to the master branch, or checking something out in the submodule without updating the extension-reference in core, they're doing it wrong. [14:34:52] (03PS2) 10Mobrovac: mobileapps service: Role and module for SCA [puppet] - 10https://gerrit.wikimedia.org/r/207050 (https://phabricator.wikimedia.org/T92627) [14:35:13] RECOVERY - puppet last run on snapshot1004 is OK Puppet is currently enabled, last run 51 seconds ago with 0 failures [14:35:38] (03CR) 10Alexandros Kosiaris: [C: 032] Assign LVS IPs to the graphoid service [dns] - 10https://gerrit.wikimedia.org/r/205856 (https://phabricator.wikimedia.org/T90487) (owner: 10Alexandros Kosiaris) [14:38:02] (03CR) 10Alexandros Kosiaris: [C: 032] Graphoid: LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/206106 (https://phabricator.wikimedia.org/T90487) (owner: 10Mobrovac) [14:39:00] (03PS2) 10Mobrovac: mobileapps service: LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/207051 (https://phabricator.wikimedia.org/T92627) [14:39:06] (03CR) 10jenkins-bot: [V: 04-1] mobileapps service: LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/207051 (https://phabricator.wikimedia.org/T92627) (owner: 10Mobrovac) [14:39:35] apergos: most of the job you mentioned earlier are actually in separate queues, so won't delay other classes of jobs [14:40:56] without going into specifics of the queues I just wanted to give the general sense of what was going on (that all 2000 jobs weren't going to be queued and run instantly) [14:41:25] because there are so few rename jobs queued it's rare to see one via ps on a job runner [14:41:41] (03PS1) 10Faidon Liambotis: smokeping: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/207117 [14:41:45] so all I culd really o was give them a sense of the churn... [14:41:49] *could do [14:42:08] where can I look for the specific queues btw? [14:42:11] (03PS2) 10Faidon Liambotis: smokeping: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/207117 [14:42:34] (03CR) 10Faidon Liambotis: [C: 032 V: 032] smokeping: brown paper bag fix [puppet] - 10https://gerrit.wikimedia.org/r/207117 (owner: 10Faidon Liambotis) [14:45:10] apergos: one place is modules/mediawiki/templates/jobrunner/jobrunner.conf.erb [14:45:19] ah the conf file [14:45:48] thanks, the code is much different than the last time I stuck my nose in it [14:45:54] that's not exhaustive, but lists all the jobs with separate queues [14:46:01] works for me [14:46:06] the remaining ones end up in the default queue afaik [14:46:11] right [14:46:18] <^d> Yeah, if they're not given a dedicated runner [14:46:25] <^d> The default runners will grab them [14:46:46] <^d> apergos: Also, `mwscript showJobs.php --wiki=foowiki --group` is useful MW script on terbium/tin [14:46:50] ah [14:47:04] I was about to ask what cmd line tools we have around in orer to not dig around in the redis ddb [14:47:11] which would have been my next step [14:47:19] <^d> That'll give you a list of currently queued and claimed jobs ^ [14:47:22] perfect [14:47:27] <^d> Or counts, rather [14:47:31] <^d> List would be big :p [14:47:36] yes well [14:47:45] per queue? [14:47:51] <^d> Yep, hence --group [14:47:56] nice indeed [14:48:18] <^d> Without it it'll just give you the total # of queued jobs across all queues iirc [14:48:36] not as useful ("large!" "gee thanks") [14:49:08] <^d> hehe [14:49:11] <^d> $ mwscript showJobs.php --wiki=enwiki [14:49:11] <^d> 8946162 [14:49:30] remember when we use to think that was too many? :-D [14:49:41] <^d> Almost 9mn jobs? Heh yeah [14:50:36] gwicke, kart_, Krinkle: Ping for SWAT in 10 minutes. [14:50:45] (03PS3) 10Mobrovac: mobileapps service: Role and module for SCA [puppet] - 10https://gerrit.wikimedia.org/r/207050 (https://phabricator.wikimedia.org/T92627) [14:50:58] pong [14:51:20] !log restarted pybal on lvs1006 [14:51:27] Logged the message, Master [14:51:29] should really get the gdash stuff working again [14:52:36] <^d> graphs for graphite would be nice :) [14:53:16] (03CR) 10ArielGlenn: "I was expecting the compression piece and recopyng and all that to be done by you but I can do it in the python piece if preferred." [dumps] (ariel) - 10https://gerrit.wikimedia.org/r/206849 (owner: 10ArielGlenn) [14:53:23] (03PS3) 10Mobrovac: mobileapps service: LVS configuration [puppet] - 10https://gerrit.wikimedia.org/r/207051 (https://phabricator.wikimedia.org/T92627) [14:55:44] (03CR) 10Alexandros Kosiaris: [C: 032] contint: make Jessie slaves package builders [puppet] - 10https://gerrit.wikimedia.org/r/203073 (https://phabricator.wikimedia.org/T95545) (owner: 10Hashar) [14:58:20] !log restart pybal on lvs1003 [14:58:24] Logged the message, Master [14:58:46] (03PS2) 10BBlack: iegreview - Raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/206983 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [14:58:48] (03PS2) 10Mobrovac: mobileapps service: Varnish / parsoidcache configuration [puppet] - 10https://gerrit.wikimedia.org/r/207052 (https://phabricator.wikimedia.org/T92627) [14:59:02] anyone about to merge more ops/puppet stuff in the next couple of minutes? I need to shove through like 8 patches without going through rebase-hell [14:59:33] RECOVERY - Host analytics1016 is UPING OK - Packet loss = 0%, RTA = 1.28 ms [14:59:34] RECOVERY - RAID on analytics1016 is OK no disks configured for RAID [14:59:43] RECOVERY - Disk space on analytics1016 is OK: DISK OK [14:59:43] RECOVERY - configured eth on analytics1016 is OK - interfaces up [14:59:54] RECOVERY - SSH on analytics1016 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2 (protocol 2.0) [14:59:54] RECOVERY - Hadoop DataNode on analytics1016 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode [14:59:57] (03PS3) 10BBlack: iegreview - Raise HSTS max-age to 1 year and add "always" [puppet] - 10https://gerrit.wikimedia.org/r/206983 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [15:00:02] RECOVERY - dhclient process on analytics1016 is OK: PROCS OK: 0 processes with command name dhclient [15:00:04] manybubbles, anomie, ^d, thcipriani, marktraceur, anomie, gwicke, Krinkle, kart_: Respected human, time to deploy Morning SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150428T1500). Please do the needful. [15:00:10] * anomie begins SWAT [15:00:12] RECOVERY - Hadoop NodeManager on analytics1016 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [15:00:22] RECOVERY - salt-minion processes on analytics1016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:00:25] Krinkle: You're first, since no one else responded to the ping. [15:00:35] anomie: here. [15:00:55] anomie: Okay [15:01:04] anomie: should I merge extension update patch so it will be quicker while doing deployment? [15:01:04] * bblack mourns the loss of grrrit-wm [15:01:23] I'll wait for after SWAT [15:01:34] kart_: Yours is a config change [15:01:43] RECOVERY - DPKG on analytics1016 is OK: All packages OK [15:01:49] Oh, last-minute additions. [15:01:53] anomie: and CX extension updates. [15:01:53] PROBLEM - NTP on analytics1016 is CRITICAL: NTP CRITICAL: Offset unknown [15:02:04] kart_: Yes, and prepare the submodule updates for mediawiki/core too. [15:02:11] 6operations, 10ops-eqiad: analytics1016 down - https://phabricator.wikimedia.org/T97349#1241703 (10Cmjohnson) Analytics1016 was hung because of the same power firmware error that we just had on analytics1020. I continued with f1 and the OS is up. I need to run a firmware update and hopefully that will clear... [15:02:25] anomie: that's done. [15:02:30] kart_: Oh, you're good then. [15:02:50] anomie: should I merge them or you'll be doing as part of SWAT? [15:03:04] kart_: I'll merge the core changes as part of SWAT [15:03:13] RECOVERY - puppet last run on analytics1016 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [15:03:22] anomie: ok. thanks! [15:03:34] RECOVERY - NTP on analytics1016 is OK: NTP OK: Offset -0.003293156624 secs [15:05:00] kart_: BTW, you really should be following https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Case_1b:_extension_changes, not https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Case_1c:_extension_update [15:06:17] 6operations, 10Graphoid, 6Services, 10service-template-node, and 2 others: Deploy graphoid service into production - https://phabricator.wikimedia.org/T90487#1241713 (10mobrovac) >>! In T90487#1241635, @Yurik wrote: > Diagram has been added to the https://www.mediawiki.org/w/index.php?title=Extension:Graph... [15:08:00] anomie: right. But, we wanted two changes. So, do I need to do cherry-pick two times? [15:08:19] anomie: I should take care from next time then. [15:08:46] kart_: Yeah, you'd cherry-pick and merge both. Then you could do one core update for the both, especially if they're related. Just updating the extension to master might clobber any other changes someone else might have done. [15:09:22] 6operations, 10ops-eqiad: db1060 raid degraded - https://phabricator.wikimedia.org/T96471#1241732 (10Cmjohnson) Disk Request Sent. Congratulations: Work Order SR910475910 was successfully submitted. [15:10:23] anomie: Noted. [15:10:44] anomie: In this case, we've verified changes, so all is good. [15:11:09] !log anomie Synchronized php-1.26wmf3/extensions/CentralAuth/: SWAT: CentralAuth: Fix missing "&" in onMakeGlobalVariablesScript signature [[gerrit:207021]] (duration: 00m 29s) [15:11:16] Logged the message, Master [15:11:19] kart_: Or you might wind up with someone else clobbering your change because they updated the wmf branch [15:11:20] Krinkle: ^ Test please [15:12:17] anomie: confirmed on mw.org [15:12:17] anomie: ah. got that :/ [15:13:02] !log anomie Synchronized php-1.26wmf2/extensions/CentralAuth/: SWAT: CentralAuth: Fix missing "&" in onMakeGlobalVariablesScript signature [[gerrit:207023]] (duration: 00m 24s) [15:13:03] Krinkle: ^ Test please (for wmf2 now) [15:13:07] Logged the message, Master [15:13:10] cmjohnson1: yt? [15:13:14] yes [15:13:18] " Please do not add back yet until I do the update." [15:13:20] whathca mean? [15:13:46] Tim-away: You merged https://gerrit.wikimedia.org/r/#/c/207066/1 but never deployed it :/ [15:14:00] anomie: Confirmed on metawiki [15:14:11] anomie: You're next [15:14:41] > Uncaught TypeError: Converting circular structure to JSON [15:14:45] :D [15:14:49] (j/k) [15:15:10] ottomata: I think he means he hasn't fixed the firmware issue yet, and wants to do so before it goes back into service [15:15:36] (because that presumably requires a host outage) [15:15:47] ottomata: I mean that I will be doing more testing on it this week so it should stay in maintenance and out of production [15:16:07] ok, welp, if it comes back online, the hadoop daemons start up and rejoin the cluster, so its got jobs running on it already! [15:16:07] :) [15:17:01] okay [15:17:10] i will just ping you when I need to take it down again [15:17:12] should we stop them and take it offline? or should we just wait til you are ready [15:17:13] yeah, let's do that [15:17:17] cmjohnson1: while we're on this subject: a bunch of our previous-gen R620 cache hardware now tends to spam: [15:17:19] it should be easy enough to take offline when we are ready [15:17:20] [2419299.339753] CPU12: Package temperature above threshold, cpu clock throttled (total events = 165539) [15:17:23] [2419299.340631] CPU12: Core temperature/speed normal [15:17:27] under jessie. related to "power firmware" fixup? [15:17:36] hasn't caused a pragmatic issue, yet [15:18:05] okay, I will pull some info off logs and get dell to send me a f/w update [15:18:27] *will require reboots [15:19:23] k [15:21:03] salt -G 'cluster:cache_*' cmd.run 'dmesg|grep throttled|head' [15:21:15] ^ on palladium, that's a good finder of cache nodes with the msgs [15:23:42] !log anomie Synchronized php-1.26wmf3/includes/api/ApiQuery.php: SWAT: API: Remove metadata keys from indexpageids output [[gerrit:206861]] (duration: 00m 17s) [15:23:43] anomie: ^ Test please [15:23:46] Logged the message, Master [15:23:50] anomie: Works [15:23:53] kart_: You're up [15:24:00] cool. [15:24:03] kart_: Config change first? [15:24:06] yep [15:24:19] (03CR) 10Anomie: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207048 (https://phabricator.wikimedia.org/T96486) (owner: 10KartikMistry) [15:26:39] (03CR) 10KartikMistry: [C: 031] "This should go now!" [puppet] - 10https://gerrit.wikimedia.org/r/207047 (https://phabricator.wikimedia.org/T96486) (owner: 10KartikMistry) [15:26:59] akosiaris: godog can you merge https://gerrit.wikimedia.org/r/#/c/207047/ please? :) [15:30:23] (03Merged) 10jenkins-bot: Enable Content Translation in cs, el, kk and zu [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207048 (https://phabricator.wikimedia.org/T96486) (owner: 10KartikMistry) [15:30:55] !log anomie Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable Content Translation in cs, el, kk and zu [[gerrit:207048]] (duration: 00m 21s) [15:30:55] kart_: ^ Test please [15:31:09] sure [15:31:37] Logged the message, Master [15:33:39] anomie: still not there. Refreshing.. [15:34:50] kart_: Oh, my fault. [15:35:07] * anomie needs to stop forgetting to actually pull the change... [15:35:13] !log anomie Synchronized wmf-config/InitialiseSettings.php: SWAT: Enable Content Translation in cs, el, kk and zu [[gerrit:207048]] (duration: 00m 27s) [15:35:14] kart_: ^ Try now [15:35:31] anomie: ok. we're good! [15:35:36] anomie: thanks! [15:35:50] kart_: Don't go away yet, your extension updates are next [15:36:20] yes yes. [15:36:22] :) [15:41:33] (03CR) 10Thcipriani: [C: 032] Make scap localization cache build $TMPDIR aware [tools/scap] - 10https://gerrit.wikimedia.org/r/206856 (https://phabricator.wikimedia.org/T97257) (owner: 10Thcipriani) [15:42:17] Any opsen can merge change please: https://gerrit.wikimedia.org/r/#/c/207047/ [15:42:38] akosiaris: godog ^^ [15:44:17] (03Merged) 10jenkins-bot: Make scap localization cache build $TMPDIR aware [tools/scap] - 10https://gerrit.wikimedia.org/r/206856 (https://phabricator.wikimedia.org/T97257) (owner: 10Thcipriani) [15:45:16] !log anomie Synchronized php-1.26wmf3/extensions/ContentTranslation: SWAT: Update ContentTranslation [[gerrit:207098]] (duration: 00m 46s) [15:45:17] kart_: ^ Test please [15:45:25] Logged the message, Master [15:47:44] anomie: looks fine. [15:48:49] Grr. Someone screwed things up on the wmf2 branch extensions/ContentTranslation/... [15:49:19] anomie: Yuk. Still messed up? [15:49:30] anomie: there were conflict last time. [15:50:00] kart_: Looks like your updating to master came back to bite you. Let me see what exactly the conflict is. [15:50:28] :/ [15:50:47] anomie: I think it was due to cherry-pick and then updating to master. [15:51:12] kart_: You're conflicting with https://gerrit.wikimedia.org/r/#/c/204522/ that was deployed to the wmf2 branch. [15:51:56] kart_: For some reason the corresponding master change (https://gerrit.wikimedia.org/r/#/c/204520/) was abandoned [15:52:09] Nikerabbit: ^ Any idea what the deal is there? [15:52:26] anomie: it was later merged as separate patch. [15:52:32] (03CR) 10John F. Lewis: [C: 031] add policy.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/206972 (https://phabricator.wikimedia.org/T97329) (owner: 10Dzahn) [15:52:43] anomie: so, it isn't lost. but, that created issue. [15:52:45] anomie: yes it was a hack, we wanted to merge prettier code to master with some more time [15:52:54] (03CR) 10John F. Lewis: [C: 031] varnish: add misc-web config for policy.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/206974 (https://phabricator.wikimedia.org/T97329) (owner: 10Dzahn) [15:53:05] Nikerabbit: So is it safe to remove that from wmf2 while updating it to master? [15:53:24] anomie: yep [15:53:28] anomie: yes [15:53:43] (03CR) 10John F. Lewis: [C: 031] policy.wm.org: minimal module/role for microsite [puppet] - 10https://gerrit.wikimedia.org/r/206978 (https://phabricator.wikimedia.org/T97329) (owner: 10Dzahn) [15:55:10] !log anomie Synchronized php-1.26wmf2/extensions/ContentTranslation: SWAT: Update ContentTranslation [[gerrit:207092]] (duration: 00m 58s) [15:55:12] kart_: ^ Test please [15:55:16] Logged the message, Master [15:55:36] gwicke: Here yet for SWAT? [15:56:33] anomie: cool. All good. [15:56:40] anomie: Thanks! [15:56:45] kart_: You're welcome [15:57:09] anomie: I'll take care about cherry-pick v/s master for SWAT onwards. [15:57:22] kart_: Good [15:57:23] ! [15:57:53] Now I need someone to merge puppet patch :/ [15:58:03] * anomie declares SWAT closed [16:00:42] 6operations, 10Analytics-EventLogging, 6Analytics-Kanban: EventLogging query strings are truncated to 1014 bytes by ?(varnishncsa? or udp packet size?) - https://phabricator.wikimedia.org/T91347#1241891 (10Tgr) 5declined>3Open Sending a warning message when the logging call fails does not fix the issue o... [16:01:40] (03PS1) 10Giuseppe Lavagetto: hiera/nuyaml: remove dynamic lookups [puppet] - 10https://gerrit.wikimedia.org/r/207127 [16:01:42] (03PS1) 10Giuseppe Lavagetto: hiera: Add a proxy backend [puppet] - 10https://gerrit.wikimedia.org/r/207128 [16:01:44] (03PS1) 10Giuseppe Lavagetto: hiera: use the proxy backend, rationalize the hierarchy [puppet] - 10https://gerrit.wikimedia.org/r/207129 [16:02:21] anomie: Argh, 206319 didn't go out? [16:02:34] (03PS1) 10KartikMistry: Added initial Debian package for apertium-oc-ca [debs/contenttranslation/apertium-oc-ca] - 10https://gerrit.wikimedia.org/r/207130 (https://phabricator.wikimedia.org/T96655) [16:02:51] James_F: No, gwicke didn't respond to pings during the SWAT window, and no one else stepped up either. Sorry. [16:03:13] No activity on gerrit and no ping on IRC means people don't know you're asking for them. [16:03:21] Oh well. [16:03:34] * James_F reschedules for this afternoon instead, and Ops can shout at us some more. [16:03:43] PROBLEM - puppet last run on snapshot1004 is CRITICAL puppet fail [16:04:43] * _joe_ shouts at James_F [16:04:46] James_F: If more people want to be pinged, they should add their names to the appropriate line on the Deployments page. I can only go by who is listed there. [16:05:30] (03PS5) 10coren: WIP: Proper labs_storage class [puppet] - 10https://gerrit.wikimedia.org/r/199267 (https://phabricator.wikimedia.org/T85606) [16:05:34] anomie: Absolutely. [16:05:52] PROBLEM - puppet last run on mw2183 is CRITICAL puppet fail [16:05:56] (03CR) 10coren: [C: 04-1] "wip" [puppet] - 10https://gerrit.wikimedia.org/r/199267 (https://phabricator.wikimedia.org/T85606) (owner: 10coren) [16:13:20] ottomata: are you around? [16:13:50] yes, in a meeting, out in a few mins [16:14:07] or, right now! [16:14:09] hi kart_what's up? [16:15:02] ottomata: can you merge https://gerrit.wikimedia.org/r/#/c/207047/ [16:15:24] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1241948 (10JohnLewis) [16:15:26] ottomata: simple change :) (we need to do announcement, so it is really needed) [16:15:46] (03PS2) 10Ottomata: CX: Add Czech, Greek, Kazakh and Zulu [puppet] - 10https://gerrit.wikimedia.org/r/207047 (https://phabricator.wikimedia.org/T96486) (owner: 10KartikMistry) [16:15:46] k [16:15:54] (03CR) 10Ottomata: [C: 032 V: 032] CX: Add Czech, Greek, Kazakh and Zulu [puppet] - 10https://gerrit.wikimedia.org/r/207047 (https://phabricator.wikimedia.org/T96486) (owner: 10KartikMistry) [16:16:07] done [16:16:59] ottomata: cool. Thanks! [16:21:21] (03PS1) 10KartikMistry: Added initial Debian package for apertium-oc-es [debs/contenttranslation/apertium-oc-es] - 10https://gerrit.wikimedia.org/r/207131 (https://phabricator.wikimedia.org/T96655) [16:24:03] RECOVERY - puppet last run on mw2183 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [16:25:13] RECOVERY - puppet last run on snapshot1004 is OK Puppet is currently enabled, last run 41 seconds ago with 0 failures [16:28:09] 7Blocked-on-Operations, 6operations, 10Continuous-Integration-Infrastructure, 6Release-Engineering, and 2 others: Jenkins: Re-enable lint checks for Apache config in operations-puppet - https://phabricator.wikimedia.org/T72068#1242001 (10greg) 5Open>3stalled [16:30:18] (03PS1) 10John F. Lewis: Add ebernhardson to researchers group [puppet] - 10https://gerrit.wikimedia.org/r/207133 (https://phabricator.wikimedia.org/T97332) [16:31:53] (03CR) 10Legoktm: mediawiki: Add test to verify redirects.conf has been regenerated from redirects.dat (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/204994 (https://phabricator.wikimedia.org/T72068) (owner: 10Legoktm) [16:36:29] 6operations, 6Phabricator: have any task put into ops-access-requests automatically generate an ops-access-review task - https://phabricator.wikimedia.org/T87467#1242033 (10mmodell) Doesn't phabricator always assume the author has access? maybe I can work around that. [16:37:04] (03CR) 10Filippo Giunchedi: [C: 031] Remove sampling of api.log (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206865 (https://phabricator.wikimedia.org/T88393) (owner: 10Anomie) [16:43:00] (03PS1) 10John F. Lewis: remove production tungsten dns [dns] - 10https://gerrit.wikimedia.org/r/207136 (https://phabricator.wikimedia.org/T97274) [16:43:10] (03CR) 10jenkins-bot: [V: 04-1] remove production tungsten dns [dns] - 10https://gerrit.wikimedia.org/r/207136 (https://phabricator.wikimedia.org/T97274) (owner: 10John F. Lewis) [16:43:13] (03PS1) 10John F. Lewis: reclaim tungsten [puppet] - 10https://gerrit.wikimedia.org/r/207137 (https://phabricator.wikimedia.org/T97274) [16:43:33] (03PS2) 10John F. Lewis: remove production tungsten dns [dns] - 10https://gerrit.wikimedia.org/r/207136 (https://phabricator.wikimedia.org/T97274) [16:48:48] 6operations, 10Wikimedia-Logstash, 5Patch-For-Review: Rack and Setup (3) Logstash Servers - https://phabricator.wikimedia.org/T96692#1242065 (10bd808) >>! In T96692#1232167, @bd808 wrote: > * - logstash: Remove redis input > * 6operations, 10Wikimedia-Logstash, 5Patch-For-Review: Rack and Setup (3) Logstash Servers - https://phabricator.wikimedia.org/T96692#1242074 (10RobH) partman layout guidelines from RT ticket 9199: ~~~~ Yes. software RAID1 for the OS partition(s) and software RAID0 for a single data partition for Elasticsear... [16:57:30] (03PS1) 10Chad: Elastic: move auto_create_index into hiera instead of role [puppet] - 10https://gerrit.wikimedia.org/r/207140 [16:58:50] (03PS1) 10RobH: setting up new partman recipe for logstash [puppet] - 10https://gerrit.wikimedia.org/r/207142 [17:01:07] (03CR) 10RobH: [C: 032] setting up new partman recipe for logstash [puppet] - 10https://gerrit.wikimedia.org/r/207142 (owner: 10RobH) [17:01:36] 6operations, 6Phabricator: have any task put into ops-access-requests automatically generate an ops-access-review task - https://phabricator.wikimedia.org/T87467#1242086 (10chasemp) >>! In T87467#1242033, @mmodell wrote: > Doesn't phabricator always assume the author has access? maybe I can work around that.... [17:05:15] @seen HaeB [17:05:15] mutante: Last time I saw HaeB they were quitting the network with reason: Ping timeout: 246 seconds N/A at 4/25/2015 6:37:28 AM (3d10h27m46s ago) [17:05:44] hmm, i doubt the "3d" part [17:06:05] because i got pinged by him last night [17:09:04] (03PS6) 10coren: WIP: Proper labs_storage class [puppet] - 10https://gerrit.wikimedia.org/r/199267 (https://phabricator.wikimedia.org/T85606) [17:10:07] (03CR) 10coren: "This still needs substantive testing as much of it has been rewritten" [puppet] - 10https://gerrit.wikimedia.org/r/199267 (https://phabricator.wikimedia.org/T85606) (owner: 10coren) [17:10:25] (03CR) 10Manybubbles: [C: 031] Elastic: move auto_create_index into hiera instead of role [puppet] - 10https://gerrit.wikimedia.org/r/207140 (owner: 10Chad) [17:11:26] 6operations, 6Phabricator, 10Wikimedia-Bugzilla, 7Tracking: Tracking: Remove Bugzilla from production - https://phabricator.wikimedia.org/T95184#1242133 (10Nemo_bis) [17:12:31] (03CR) 10JanZerebecki: [C: 031] iegreview - Raise HSTS max-age to 1 year and add "always" [puppet] - 10https://gerrit.wikimedia.org/r/206983 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [17:19:52] (03CR) 10Andrew Bogott: [C: 031] "seems better." [puppet] - 10https://gerrit.wikimedia.org/r/202788 (https://phabricator.wikimedia.org/T48554) (owner: 10Southparkfan) [17:20:52] (03PS1) 10Ori.livneh: Report code coverage to coveralls [debs/pybal] - 10https://gerrit.wikimedia.org/r/207146 [17:21:15] (03CR) 10Ori.livneh: [C: 032 V: 032] "only touches .travis.yml" [debs/pybal] - 10https://gerrit.wikimedia.org/r/207146 (owner: 10Ori.livneh) [17:21:33] cmjohnson1: seen this one? https://phabricator.wikimedia.org/T97339 [17:21:59] no..give me a minute and I will go check on that [17:22:04] thx [17:22:15] thanks [17:24:53] (03PS1) 10John F. Lewis: add symlink species.org->wikimedia.com [dns] - 10https://gerrit.wikimedia.org/r/207149 (https://phabricator.wikimedia.org/T9495) [17:29:32] (03CR) 10Yuvipanda: [C: 031] "puppet lookes good to me" [puppet] - 10https://gerrit.wikimedia.org/r/203864 (https://phabricator.wikimedia.org/T95555) (owner: 10coren) [17:32:00] (03PS1) 10RobH: setting new logstash nodes to use jessie [puppet] - 10https://gerrit.wikimedia.org/r/207150 [17:34:11] (03CR) 10RobH: [C: 032] setting new logstash nodes to use jessie [puppet] - 10https://gerrit.wikimedia.org/r/207150 (owner: 10RobH) [17:35:24] (03CR) 10coren: [C: 032] Labs: Remove idmap dependency on instances [puppet] - 10https://gerrit.wikimedia.org/r/203864 (https://phabricator.wikimedia.org/T95555) (owner: 10coren) [17:35:36] (03PS2) 10coren: Labs: Remove idmap dependency on instances [puppet] - 10https://gerrit.wikimedia.org/r/203864 (https://phabricator.wikimedia.org/T95555) [17:35:46] 6operations: Reinstall oxygen with Jessie - https://phabricator.wikimedia.org/T97331#1242289 (10Cmjohnson) [17:35:47] 6operations, 10ops-eqiad: Cannot get serial redirection console on oxygen - https://phabricator.wikimedia.org/T97339#1242286 (10Cmjohnson) 5Open>3Resolved a:3Cmjohnson The idrac must have hung. I verified the settings, disconnected everything, drained flea power and booted. Serial console is now working.... [17:37:59] (03CR) 10coren: [C: 032] Labs: Remove idmap dependency on instances [puppet] - 10https://gerrit.wikimedia.org/r/203864 (https://phabricator.wikimedia.org/T95555) (owner: 10coren) [17:41:05] (03PS2) 10Dzahn: reclaim tungsten [puppet] - 10https://gerrit.wikimedia.org/r/207137 (https://phabricator.wikimedia.org/T97274) (owner: 10John F. Lewis) [17:43:03] (03CR) 10Dzahn: [C: 032] reclaim tungsten [puppet] - 10https://gerrit.wikimedia.org/r/207137 (https://phabricator.wikimedia.org/T97274) (owner: 10John F. Lewis) [17:45:25] 6operations, 10ops-eqiad, 5Patch-For-Review: reclaim tungsten as spare - https://phabricator.wikimedia.org/T97274#1242384 (10Dzahn) ``` [palladium:~] $ puppet cert clean tungsten.eqiad.wmnet Notice: Revoked certificate with serial 2397 Notice: Removing file Puppet::SSL::Certificate tungsten.eqiad.wmnet at '/... [17:49:08] !log tungsten - revoke puppet cert, delete salt-key, delete from stored configs [17:49:14] Logged the message, Master [17:51:01] cmjohnson1 / mutante - im livehacking shit on carbon [17:51:10] k [17:51:10] if you two go to do an install, it may not work this second [17:51:28] robh: just the opposite, shutting one down [17:51:29] but i can restore in less then two minutes if you need to start one, no problem. [17:51:36] cool, just fyi then =] [17:51:36] fine with me [17:51:41] thx [17:51:54] * robh is trying to get a new partman recipe written and tested [17:52:11] instead i introduced networking errors into the installer =P [17:52:12] kills tungsten [17:52:20] new spare? [17:52:24] yes [17:52:32] cool, make task pls (cuz we need onsite to wipe it) [17:52:55] cmjohnson1: oh, speaking of spares, I'm going to detail out some tasks soon for spare server audits in both eqiad and codfw [17:53:03] yes, i'm already acting on a ticket created by godog [17:53:13] there is a ticket [17:53:15] since all past reviews were done by just me, instead we're going to do a review and audit [17:53:18] oh? [17:53:41] (or was ticket reference to the spare wipe?) [17:54:06] no separate ticket for a wipe...just a decom ticket [17:54:54] !log tungsten - disable in icinga. scheduled the longest downtime. shutdown -h now (T97274) [17:55:00] Logged the message, Master [17:55:08] mutante: :D [17:55:11] well, if its going to chris to wipe, it can be one tikcet [17:55:16] cuz cmjohnson1 can just add back to spares [17:55:28] but for codfw, for now, i want two tickets for decom, one to decom (sw side) and one for onsite [17:55:35] (just fyi) [17:55:58] cmjohnson1: so yea, if that wasn't clear, anytime its a decom, you can simply add back to spares (referene the task # in the edit summary) [17:56:05] when you finish the wipe [17:56:13] cool...i've been doing that anyway for the most part [17:56:16] if its not goign back to spares, the wipe ticket will say so [17:56:17] yep [17:56:24] i know, but now you officially know =] [17:56:26] just easier [17:58:47] 6operations, 10ops-eqiad, 5Patch-For-Review: reclaim tungsten as spare - https://phabricator.wikimedia.org/T97274#1242482 (10Dzahn) 10:49 < mutante> !log tungsten - revoke puppet cert, delete salt-key, delete from stored configs 10:55 < mutante> !log tungsten - disable in icinga. scheduled the longest downti... [18:00:05] twentyafterfour, greg-g: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150428T1800). Please do the needful. [18:01:30] 6operations, 10Deployment-Systems, 10MediaWiki-ResourceLoader: Bad cache stuck due to race condition with scap between different web servers - https://phabricator.wikimedia.org/T47877#1242508 (10Krinkle) [18:04:21] grrrit-wm1: pling [18:04:45] Coren: paravoid (if around) can you look at https://gerrit.wikimedia.org/r/207157 [18:04:57] * YuviPanda is still a core-linux-systems noob and wondering if I’m doing anything wrong [18:05:12] I’ll have to cleanup the old /tmp by hand, and this is just for the new exec nodes [18:09:43] andrewbogott: ^^ too (if you have time) [18:13:02] grrit-wm1: that's not what i meant :p [18:13:52] mutante: I rebooted it [18:14:09] YuviPanda: ah:) thx [18:14:47] (03CR) 10coren: "This looks okay to me but if we are to encourage tools users to use /tmp then we need to set up cleaning of /tmp" [puppet] - 10https://gerrit.wikimedia.org/r/207157 (https://phabricator.wikimedia.org/T97445) (owner: 10Yuvipanda) [18:16:00] (03PS1) 10Dzahn: enforce-users-groups: remove tungsten references [puppet] - 10https://gerrit.wikimedia.org/r/207159 (https://phabricator.wikimedia.org/T97274) [18:16:38] (03CR) 10John F. Lewis: [C: 031] enforce-users-groups: remove tungsten references [puppet] - 10https://gerrit.wikimedia.org/r/207159 (https://phabricator.wikimedia.org/T97274) (owner: 10Dzahn) [18:21:13] ok I'm gonna deploy the train ... [18:21:42] (03CR) 10Dzahn: [C: 032] "mwprof module is unused. godog will make follow-up ticket to remove the puppet code too" [puppet] - 10https://gerrit.wikimedia.org/r/207159 (https://phabricator.wikimedia.org/T97274) (owner: 10Dzahn) [18:22:17] !log upgraded and restarted Eventlogging on hafnium (now at be1e055) [18:22:19] (03PS3) 10Yuvipanda: tools: Create separate /tmp LVM volume for all exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/207157 (https://phabricator.wikimedia.org/T97445) [18:22:23] Logged the message, Master [18:23:30] (03PS1) 1020after4: Group1 wikis to 1.26wmf3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207161 [18:25:11] 6operations, 10ops-eqiad, 5Patch-For-Review: reclaim tungsten as spare - https://phabricator.wikimedia.org/T97274#1242620 (10Dzahn) a:3Cmjohnson @cmjohnson could you edit the spares wiki page and put tungsten back in the pool? i removed from DNS, puppet and icinga and shut it down already. thanks [18:25:48] (03PS1) 10Dereckson: Add *.nasqueron.org to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207162 (https://phabricator.wikimedia.org/T97448) [18:27:28] (03PS1) 10RobH: logstash1004 to trusty [puppet] - 10https://gerrit.wikimedia.org/r/207163 [18:27:33] paravoid: qq, yt? [18:27:40] (03PS2) 10John F. Lewis: add symlink wikispecies.org->wikimedia.com [dns] - 10https://gerrit.wikimedia.org/r/207149 (https://phabricator.wikimedia.org/T9495) [18:31:59] !log upgraded and restarted Eventlogging on eventlog1001 (now at be1e055) [18:32:03] Logged the message, Master [18:32:28] (03CR) 1020after4: [C: 032] Group1 wikis to 1.26wmf3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207161 (owner: 1020after4) [18:32:34] 6operations, 10Wikimedia-Mailing-lists: close and delete the flowfunding mailing list - https://phabricator.wikimedia.org/T97328#1242663 (10BBlack) [18:33:33] (03Merged) 10jenkins-bot: Group1 wikis to 1.26wmf3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207161 (owner: 1020after4) [18:33:44] (03CR) 10RobH: [C: 032] logstash1004 to trusty [puppet] - 10https://gerrit.wikimedia.org/r/207163 (owner: 10RobH) [18:34:03] !log twentyafterfour rebuilt wikiversions.cdb and synchronized wikiversions files: Group1 wikis to 1.26wmf3 [18:34:09] Logged the message, Master [18:34:36] (03PS1) 10Ottomata: Set up kafkatee instance on oxygen for ops webrequest log debugging [puppet] - 10https://gerrit.wikimedia.org/r/207166 (https://phabricator.wikimedia.org/T96616) [18:35:12] (03PS2) 10BBlack: RT - Raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/206977 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [18:35:19] (03CR) 10BBlack: [C: 032] RT - Raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/206977 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [18:35:25] (03CR) 10BBlack: [V: 032] RT - Raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/206977 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [18:35:29] (03CR) 10Ottomata: Set up kafkatee instance on oxygen for ops webrequest log debugging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/207166 (https://phabricator.wikimedia.org/T96616) (owner: 10Ottomata) [18:35:33] (03PS2) 10BBlack: donate - Raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/206979 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [18:35:39] (03CR) 10BBlack: [C: 032 V: 032] donate - Raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/206979 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [18:35:46] (03PS2) 10BBlack: servermon - Raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/206982 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [18:35:53] (03CR) 10BBlack: [C: 032 V: 032] servermon - Raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/206982 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [18:36:03] (03PS2) 10BBlack: annual - Raise HSTS max-age to 1 year and add "always" [puppet] - 10https://gerrit.wikimedia.org/r/206984 (https://phabricator.wikimedia.org/T599) (owner: 10Chmarkine) [18:36:12] (03CR) 10BBlack: [C: 032 V: 032] annual - Raise HSTS max-age to 1 year and add "always" [puppet] - 10https://gerrit.wikimedia.org/r/206984 (https://phabricator.wikimedia.org/T599) (owner: 10Chmarkine) [18:36:17] (03PS4) 10BBlack: iegreview - Raise HSTS max-age to 1 year and add "always" [puppet] - 10https://gerrit.wikimedia.org/r/206983 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [18:36:23] (03CR) 10BBlack: [C: 032 V: 032] iegreview - Raise HSTS max-age to 1 year and add "always" [puppet] - 10https://gerrit.wikimedia.org/r/206983 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [18:36:32] (03PS2) 10BBlack: ishmael - Raise HSTS max-age to 1 year and add "always" [puppet] - 10https://gerrit.wikimedia.org/r/206992 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [18:36:38] (03CR) 10BBlack: [C: 032 V: 032] ishmael - Raise HSTS max-age to 1 year and add "always" [puppet] - 10https://gerrit.wikimedia.org/r/206992 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [18:36:46] (03PS2) 10BBlack: doc - Raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/206980 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [18:36:53] (03CR) 10BBlack: [C: 032 V: 032] doc - Raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/206980 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [18:37:00] (03PS2) 10BBlack: integration - Raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/206981 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [18:37:06] (03CR) 10BBlack: [C: 032 V: 032] integration - Raise HSTS max-age to 1 year [puppet] - 10https://gerrit.wikimedia.org/r/206981 (https://phabricator.wikimedia.org/T40516) (owner: 10Chmarkine) [18:46:07] !log force merged User:Js@ruwiki to User:Js@global per global-renamers list [18:46:15] Logged the message, Master [18:47:51] !log stopping puppet on carbon - livehacking partman recipe testing [18:47:56] Logged the message, Master [18:49:55] (03PS1) 10Dereckson: Import sources on ne.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207170 [18:50:09] hrmm, i scheduled a downtime window for the puppet check on carbon, we'll see if it works [18:54:41] (03CR) 10Dzahn: [C: 032] "dzahn@sphinx:~/wmf/dns/templates$ ls -ls | grep species" [dns] - 10https://gerrit.wikimedia.org/r/207149 (https://phabricator.wikimedia.org/T9495) (owner: 10John F. Lewis) [19:19:39] PROBLEM - puppetmaster backend https on palladium is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8141: HTTP/1.1 500 Internal Server Error [19:19:45] wut [19:19:49] PROBLEM - puppet last run on mw2049 is CRITICAL puppet fail [19:19:51] PROBLEM - puppetmaster https on palladium is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 8140: HTTP/1.1 500 Internal Server Error [19:19:51] PROBLEM - puppet last run on elastic1025 is CRITICAL Puppet has 25 failures [19:19:51] PROBLEM - puppet last run on mw2158 is CRITICAL Puppet has 18 failures [19:19:57] ok, that again.. sigh [19:20:00] PROBLEM - puppet last run on db2010 is CRITICAL Puppet has 7 failures [19:20:00] PROBLEM - puppet last run on elastic1029 is CRITICAL Puppet has 12 failures [19:20:00] PROBLEM - puppet last run on db2017 is CRITICAL Puppet has 11 failures [19:20:01] PROBLEM - puppet last run on db2056 is CRITICAL Puppet has 12 failures [19:20:06] <^d> That's gonna spam [19:20:08] <^d> Here we go [19:20:10] mod_passenger? [19:20:46] yes [19:21:01] !log restarting apache on palladium [19:21:08] Logged the message, Master [19:21:11] Unexpected error in mod_passenger [19:21:29] (03CR) 10Krinkle: [C: 031] contint: Use device=none in tmpfs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/204542 (owner: 10Krinkle) [19:21:37] (03PS2) 10Krinkle: contint: Use device=none in tmpfs [puppet] - 10https://gerrit.wikimedia.org/r/204542 [19:21:44] !log tmp. stopped icinga-wm because puppetmaster fail spam [19:21:48] Logged the message, Master [19:22:34] !log twentyafterfour Synchronized php-1.26wmf2/thumb.php: (no message) (duration: 00m 33s) [19:22:40] Logged the message, Master [19:24:53] I thought NOLOGMSG=1 sync-file ... would skip the logmsgbot completely [19:25:01] guess not. [19:25:25] !log twentyafterfour Synchronized php-1.26wmf3/thumb.php: (no message) (duration: 00m 19s) [19:25:31] Logged the message, Master [19:26:25] 10Ops-Access-Requests, 6operations: Access to delegated Gmail accounts for krobinson, mbeattie - https://phabricator.wikimedia.org/T97461#1242948 (10Juro2351) 3NEW a:3atgo [19:28:19] twentyafterfour, sigh... [19:29:06] twentyafterfour, it's DOLOGMSGNOLOG=1 [19:29:33] outdated documentation is awesome [19:29:40] the documentation shows that [19:29:51] not the documentation I've been working from [19:30:17] https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys [19:30:42] https://wikitech.wikimedia.org/wiki/How_to_deploy_code#Creating_a_Security_Patch [19:31:55] 10Ops-Access-Requests, 6operations: Access to delegated Gmail accounts for krobinson, mbeattie - https://phabricator.wikimedia.org/T97461#1242986 (10Krenair) 5Open>3Invalid Google Apps stuff is administrated by the WMF Office IT team, you'll need to ask them. [19:32:37] fixed [19:32:45] (fixed the wikipage) [19:34:10] !log Deployed patch for T97391 [19:34:13] https://wikitech.wikimedia.org/w/index.php?title=How_to_deploy_code&diff=prev&oldid=141721 - heh, I had to fix something similar only a few months ago [19:34:18] Logged the message, Master [19:44:39] 6operations: MaxClients on puppetmaster - https://phabricator.wikimedia.org/T97466#1243042 (10Dzahn) 3NEW [19:45:34] (03Abandoned) 10Yuvipanda: tools: Ensure that exim-heavy only is on tools-mail [puppet] - 10https://gerrit.wikimedia.org/r/205915 (owner: 10Yuvipanda) [19:47:52] (03CR) 10Merlijn van Deen: [C: 04-1] "-1'ing to clarify it's not an unreviewed change but a change with open questions" [puppet] - 10https://gerrit.wikimedia.org/r/205914 (https://phabricator.wikimedia.org/T74867) (owner: 10Merlijn van Deen) [19:48:40] (03PS2) 10Merlijn van Deen: Tools: Let bigbrother ignore empty lines and comments [puppet] - 10https://gerrit.wikimedia.org/r/202363 (https://phabricator.wikimedia.org/T94990) (owner: 10Tim Landscheidt) [19:48:58] (03CR) 10Merlijn van Deen: [C: 031] Tools: Let bigbrother ignore empty lines and comments [puppet] - 10https://gerrit.wikimedia.org/r/202363 (https://phabricator.wikimedia.org/T94990) (owner: 10Tim Landscheidt) [19:49:13] re-enabled puppet on neon. restarting icinga-wm [19:49:40] PROBLEM - puppet last run on neon is CRITICAL puppet fail [19:49:54] icinga-wm: yes, lol [19:50:27] (03PS1) 10Hashar: Dummy README.md for labs images creation [puppet] - 10https://gerrit.wikimedia.org/r/207250 [19:52:30] (03CR) 10Hashar: "The README.md files are close to useless but would help figuring out the difference between labs_bootstrapvz and labs_vmbuilder." [puppet] - 10https://gerrit.wikimedia.org/r/207250 (owner: 10Hashar) [19:53:09] RECOVERY - puppet last run on neon is OK Puppet is currently enabled, last run 16 seconds ago with 0 failures [19:55:23] (03CR) 10Merlijn van Deen: [C: 04-1] "After tools-redis died ( https://phabricator.wikimedia.org/T96485 ), we concluded overcommit=>1 was probably one of the causes. One of the" [puppet] - 10https://gerrit.wikimedia.org/r/194095 (https://phabricator.wikimedia.org/T91498) (owner: 10Yuvipanda) [19:56:07] (03CR) 10Yuvipanda: "What, who concluded overcommit => 1 was the cause? It was the *Fix* the last time it died and couldn't come back up." [puppet] - 10https://gerrit.wikimedia.org/r/194095 (https://phabricator.wikimedia.org/T91498) (owner: 10Yuvipanda) [19:56:15] valhallasw`cloud: ^ [19:56:23] it was the fix. I set it to 1 manually and then puppetized it [19:57:05] YuviPanda: https://phabricator.wikimedia.org/T96485#1220481 [19:57:52] I thought you *disabled* overcommit [19:58:08] nope [19:58:17] valhallasw`cloud: enabled it. redis requires overcommit to work [19:59:02] well, no, saving a dump needs a few kB of memory available [19:59:07] not necessarily overcommit [19:59:15] (03PS1) 10Ori.livneh: add 'sync' alias to my (=ori's) dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/207259 [19:59:33] as long as the main redis process doesn't eat all memory it should be fine, I'd think? [19:59:37] valhallasw`cloud: read the faq :) [19:59:55] valhallasw`cloud: it needs overcommit because it forks to bgsave but then the memory in the fork isn’t used because COW and it only reads [20:00:02] but it needs overcommit otherwise the fork will fail because the process is too big [20:00:13] errr [20:00:19] which is what was happening. [20:00:22] then why did the bgsave work even with overcommit off? [20:00:23] the fork was failing and bgsave was failing [20:00:26] it didn't [20:00:36] it worked before the reboot because I had manually set it the previous time [20:00:37] then how did we have a 15GB redis database after the reboot? [20:00:38] and then made that patch [20:00:47] and it got cleared during reboot [20:00:49] 6operations, 5Interdatacenter-IPsec, 5Patch-For-Review: Fix ipv6 autoconf issues - https://phabricator.wikimedia.org/T94417#1243096 (10BBlack) So, this issue is really complicated when you get into the details. @Faidon and I have had several irc brainstorming conversations about this over the past months th... [20:00:52] and then bgsave was erroring out [20:00:58] and then I re-enabled it by hand and it worked again [20:01:04] and then I commited a patch just for tools redis [20:02:03] YuviPanda: again, how did we have ~15GB of keys from 11 apr...20 apr or so if the bgsave failed because of overcommit issues? [20:02:14] valhallasw`cloud: because overcommit was enabled from 11APR to 20APR [20:02:19] or until the reboot [20:02:22] sysctl gets cleared on reboot [20:02:32] wait, April? [20:02:41] well, whenever we were investigating it last time [20:02:41] okay, but wait. then overcommit can still have been the cause of the server lockup [20:02:47] last time was 20 apr [20:02:52] which is when the server locked up [20:02:53] the cause was redis being fed too much data for too long a time [20:02:58] and it couldn’t keep up with the eviction [20:03:09] and so the 12G ‘limit’ wasn’t enforced and it died. [20:04:18] so the cause was having a 12G redis limit and putting too many things in them [20:04:19] (03CR) 10Andrew Bogott: [C: 032] Dummy README.md for labs images creation [puppet] - 10https://gerrit.wikimedia.org/r/207250 (owner: 10Hashar) [20:04:19] YuviPanda: .... [20:04:28] (that’s the current theory, of course) [20:04:36] either way, overcommit isn’t the problem... [20:04:51] YuviPanda: if overcommit was off, bgsave should have failed if the memory used by redis was more than half the system memory, no? [20:05:01] valhallasw`cloud: bgsave *did* fail... [20:05:05] valhallasw`cloud: ok here’s the timeline [20:05:08] months ago [20:05:14] before the apr 20th failure [20:05:19] when redis died [20:05:22] I put it back up [20:05:25] and enabled overcommit manually [20:05:29] and everything was fine [20:05:33] ok. then it makes sense [20:05:34] bgsave worked fine [20:05:38] and then on apr 20th it died [20:05:42] and server reboot [20:05:49] however, that also took down the entire server in the end [20:05:49] overcommit flag was cleared, and disabled. [20:06:43] it did. but disabling overcommit isn’t going to give you anything unless you’re also willing to move max_memory to something like 6G or something [20:06:56] *nod* [20:06:59] ok, I get it then [20:07:00] thanks [20:07:04] :D [20:08:16] (03CR) 10Merlijn van Deen: "Yay, miscommunication. The question remains, though, how to configure overcommit so it doesn't take down the entire server at some point.." [puppet] - 10https://gerrit.wikimedia.org/r/194095 (https://phabricator.wikimedia.org/T91498) (owner: 10Yuvipanda) [20:08:37] valhallasw`cloud: so now we have to basically rebuild almost all of toollabs... [20:08:43] well, instances at least [20:08:52] YuviPanda: or qemu hax [20:08:57] YuviPanda: or other replication hax [20:09:11] I think that ship might have sailed for the ones already replicated [20:09:13] err [20:09:15] moved [20:09:48] it's also possible to shrink afterwards, but I don't think that can easily be done live [20:10:16] at least, it's not clear from the qemu docs to me, but andrewbogott is trying to get some clarification from the qemu devs [20:10:31] and there's also no easy way to replicate an online block device in linux [20:11:11] I’m confident that they can’t be recompressed while running. That would be too magical. [20:11:27] aaaah cool :D [20:11:41] (03PS2) 10Ori.livneh: add 'sync' alias to my (=ori's) dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/207259 [20:11:54] (03CR) 10Ori.livneh: [C: 032 V: 032] add 'sync' alias to my (=ori's) dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/207259 (owner: 10Ori.livneh) [20:12:10] (03PS3) 10Dzahn: policy.wm.org: minimal module/role for microsite [puppet] - 10https://gerrit.wikimedia.org/r/206978 (https://phabricator.wikimedia.org/T97329) [20:12:38] (03PS2) 10Dereckson: Import sources on ne.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207170 (https://phabricator.wikimedia.org/T97396) [20:13:35] (03PS4) 10Dzahn: policy.wm.org: minimal module/role for microsite [puppet] - 10https://gerrit.wikimedia.org/r/206978 (https://phabricator.wikimedia.org/T97329) [20:14:23] (03CR) 10Dzahn: [C: 032] policy.wm.org: minimal module/role for microsite [puppet] - 10https://gerrit.wikimedia.org/r/206978 (https://phabricator.wikimedia.org/T97329) (owner: 10Dzahn) [20:15:44] (03CR) 10Dzahn: [C: 032] varnish: add misc-web config for policy.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/206974 (https://phabricator.wikimedia.org/T97329) (owner: 10Dzahn) [20:16:35] andrewbogott: there's a few talks from 2012-ish that suggest it should be possible, but I don't know qemu well enough to really understand it. [20:16:41] e.g. http://www.linux-kvm.org/wiki/images/c/cf/2011-forum-qemu_live_block_copy_submit.pdf and https://events.linuxfoundation.org/images/stories/pdf/lcjp2012_bonzini.pdf [20:17:13] (03PS2) 10Dzahn: varnish: add misc-web config for policy.wm.org [puppet] - 10https://gerrit.wikimedia.org/r/206974 (https://phabricator.wikimedia.org/T97329) [20:17:46] but I'm not sure if that actually gives you a small file again in the end [20:33:35] (03PS1) 10Dzahn: add policy microsite on zirconium [puppet] - 10https://gerrit.wikimedia.org/r/207271 (https://phabricator.wikimedia.org/T97329) [20:34:50] (03PS2) 10Dzahn: add policy microsite on zirconium [puppet] - 10https://gerrit.wikimedia.org/r/207271 (https://phabricator.wikimedia.org/T97329) [20:35:34] (03CR) 10John F. Lewis: [C: 031] "because zirconium can never have enough" [puppet] - 10https://gerrit.wikimedia.org/r/207271 (https://phabricator.wikimedia.org/T97329) (owner: 10Dzahn) [20:38:28] (03CR) 10Dzahn: [C: 032] "tomorrow you'll ask for a ganeti vm for it, heh" [puppet] - 10https://gerrit.wikimedia.org/r/207271 (https://phabricator.wikimedia.org/T97329) (owner: 10Dzahn) [20:38:57] (03CR) 10John F. Lewis: "+1 to that comment" [puppet] - 10https://gerrit.wikimedia.org/r/207271 (https://phabricator.wikimedia.org/T97329) (owner: 10Dzahn) [20:39:56] (03PS1) 10Dereckson: Enable local uploads for sysop group on mai.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207273 (https://phabricator.wikimedia.org/T97397) [20:45:57] (03CR) 10Dzahn: [C: 032] add policy.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/206972 (https://phabricator.wikimedia.org/T97329) (owner: 10Dzahn) [20:46:37] 6operations, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Isolation, 7Nodepool: Create a Debian package for NodePool on Debian Jessie - https://phabricator.wikimedia.org/T89142#1243251 (10hashar) 5Open>3Resolved a:3hashar The preliminary work is completed. The package will be enha... [20:46:52] (03CR) 10Dzahn: [C: 032] delete contacts.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/206415 (https://phabricator.wikimedia.org/T90679) (owner: 10Dzahn) [20:53:38] 6operations, 10Continuous-Integration-Infrastructure, 5Continuous-Integration-Isolation, 7Nodepool: Create a Debian package for NodePool on Debian Jessie - https://phabricator.wikimedia.org/T89142#1243271 (10hashar) I have poked the [[ https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=781027 | Debian ITP ]... [20:55:46] Is there a way to purge all memc keys with a particular prefix? [20:57:56] !log anomie Synchronized php-1.26wmf3/includes/media/FormatMetadata.php: Unbreak API imageinfo with extmetadata (mainly on Commons) (duration: 00m 25s) [20:58:26] Logged the message, Master [21:00:04] rmoen, kaldari: Dear anthropoid, the time has come. Please deploy Mobile Web (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150428T2100). [21:04:31] doing Gather deployment with Rob [21:11:05] (03PS2) 10BBlack: varnish: set do_gzip to true for text [puppet] - 10https://gerrit.wikimedia.org/r/207013 (owner: 10Ori.livneh) [21:11:13] (03CR) 10BBlack: [C: 032 V: 032] varnish: set do_gzip to true for text [puppet] - 10https://gerrit.wikimedia.org/r/207013 (owner: 10Ori.livneh) [21:14:33] 6operations, 10Wikimedia-DNS, 5Patch-For-Review: Set up new URL policy.wikimedia.org - https://phabricator.wikimedia.org/T97329#1243365 (10Dzahn) @Yana see this now: https://policy.wikimedia.org/ :) Let us know once you have content for it. Should we keep this ticket open until we uploaded the actual... [21:18:22] (03CR) 10Yuvipanda: Tools: Simplify and fix mail setup (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/205914 (https://phabricator.wikimedia.org/T74867) (owner: 10Merlijn van Deen) [21:20:12] (03PS1) 10Anomie: Hook 'ValidateExtendedMetadataCache' for T97469 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207279 (https://phabricator.wikimedia.org/T97469) [21:22:57] (03CR) 10Legoktm: [C: 031] Hook 'ValidateExtendedMetadataCache' for T97469 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207279 (https://phabricator.wikimedia.org/T97469) (owner: 10Anomie) [21:23:43] (03PS2) 10Anomie: Hook 'ValidateExtendedMetadataCache' for T97469 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207279 (https://phabricator.wikimedia.org/T97469) [21:25:01] (03CR) 10Chad: [C: 032] Hook 'ValidateExtendedMetadataCache' for T97469 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207279 (https://phabricator.wikimedia.org/T97469) (owner: 10Anomie) [21:26:16] (03PS1) 10Dzahn: contacts: remove role from node and delete it [puppet] - 10https://gerrit.wikimedia.org/r/207280 (https://phabricator.wikimedia.org/T90679) [21:26:22] kaldari: Mind if I deploy a change to CommonSettings.php quick? [21:27:20] (03CR) 10Dzahn: [C: 032] contacts: remove role from node and delete it [puppet] - 10https://gerrit.wikimedia.org/r/207280 (https://phabricator.wikimedia.org/T90679) (owner: 10Dzahn) [21:30:30] !log rmoen Synchronized php-1.26wmf3/extensions/Gather/: Updating gather (duration: 00m 44s) [21:30:38] Logged the message, Master [21:30:53] (03Merged) 10jenkins-bot: Hook 'ValidateExtendedMetadataCache' for T97469 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207279 (https://phabricator.wikimedia.org/T97469) (owner: 10Anomie) [21:32:14] !log demon Synchronized wmf-config/CommonSettings.php: expire old metadata cache entries (duration: 00m 26s) [21:32:19] Logged the message, Master [21:32:21] <^d> anomie: ^^^ [21:32:26] ^d: Yay, it worked [21:34:44] (03PS1) 10Dzahn: backup: remove fileset for contacts again [puppet] - 10https://gerrit.wikimedia.org/r/207281 (https://phabricator.wikimedia.org/T90679) [21:35:33] (03CR) 10Yuvipanda: [C: 04-2] "Nope, this installs the whole world on it. Let's figure out what things we want and just get those." [puppet] - 10https://gerrit.wikimedia.org/r/204100 (owner: 10Merlijn van Deen) [21:35:51] (03PS2) 10Yuvipanda: Extend Exim diamond collector for Tool Labs [puppet] - 10https://gerrit.wikimedia.org/r/207043 (https://phabricator.wikimedia.org/T96898) (owner: 10Merlijn van Deen) [21:36:08] (03PS3) 10Yuvipanda: tools: Extended Exim diamond collector for Tool Labs [puppet] - 10https://gerrit.wikimedia.org/r/207043 (https://phabricator.wikimedia.org/T96898) (owner: 10Merlijn van Deen) [21:36:19] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Extended Exim diamond collector for Tool Labs [puppet] - 10https://gerrit.wikimedia.org/r/207043 (https://phabricator.wikimedia.org/T96898) (owner: 10Merlijn van Deen) [21:38:39] (03PS2) 10Yuvipanda: dsh: remove template from scap-proxies and just use join() [puppet] - 10https://gerrit.wikimedia.org/r/206132 (owner: 10Chad) [21:38:48] (03CR) 10Yuvipanda: [C: 032 V: 032] dsh: remove template from scap-proxies and just use join() [puppet] - 10https://gerrit.wikimedia.org/r/206132 (owner: 10Chad) [21:41:00] (03CR) 10Dzahn: mediawiki: Add test to verify redirects.conf has been regenerated from redirects.dat (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/204994 (https://phabricator.wikimedia.org/T72068) (owner: 10Legoktm) [21:43:59] (03CR) 10Dzahn: [C: 032] Tools: Fix bigbrother's patterns for web service types [puppet] - 10https://gerrit.wikimedia.org/r/201996 (https://phabricator.wikimedia.org/T94496) (owner: 10Tim Landscheidt) [21:45:15] mutante: :) thanks! [21:45:35] (03PS2) 10Dereckson: Restrict local uploads on mai.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207273 (https://phabricator.wikimedia.org/T97397) [21:46:32] YuviPanda: :) [21:46:44] (03PS2) 10Merlijn van Deen: Include python-virtualenv on redis hosts [puppet] - 10https://gerrit.wikimedia.org/r/204100 [21:46:48] mutante: I hadn’t merged it because I’m going to rip out all of that code shortly [21:47:01] (03PS3) 10Yuvipanda: Include python-virtualenv on redis hosts [puppet] - 10https://gerrit.wikimedia.org/r/204100 (owner: 10Merlijn van Deen) [21:47:33] (03CR) 10Yuvipanda: [C: 04-1] Include python-virtualenv on redis hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/204100 (owner: 10Merlijn van Deen) [21:47:41] (03CR) 10jenkins-bot: [V: 04-1] Include python-virtualenv on redis hosts [puppet] - 10https://gerrit.wikimedia.org/r/204100 (owner: 10Merlijn van Deen) [21:47:55] YES FASTER THAN JENKINS WOO [21:49:50] YuviPanda: have I mentioned I hate puppet? [21:50:00] let's have different syntaxes for the exact same thing! [21:50:00] valhallasw`cloud: *hug* [21:51:31] (03PS4) 10Merlijn van Deen: Include python-virtualenv on redis hosts [puppet] - 10https://gerrit.wikimedia.org/r/204100 [21:51:47] (03PS3) 10Dereckson: Restrict local uploads on mai.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207273 (https://phabricator.wikimedia.org/T97397) [21:51:59] PROBLEM - Check status of defined EventLogging jobs on eventlog1001 is CRITICAL Stopped EventLogging jobs: reporter/statsd consumer/server-side-events-log consumer/mysql-m4-master consumer/client-side-events-log consumer/client-side-events-kafka-log consumer/all-events-log multiplexer/all-events processor/server-side-events processor/client-side-events-kafka processor/client-side-events forwarder/8422 forwarder/8421 [21:52:46] (03PS5) 10Yuvipanda: Include python-virtualenv on redis hosts [puppet] - 10https://gerrit.wikimedia.org/r/204100 (owner: 10Merlijn van Deen) [21:52:57] (03CR) 10Yuvipanda: [C: 032] Include python-virtualenv on redis hosts [puppet] - 10https://gerrit.wikimedia.org/r/204100 (owner: 10Merlijn van Deen) [21:53:13] valhallasw`cloud: eventually we should have a toollabs::admintools and have that be included everywhere, I think [21:54:29] (03CR) 10Dereckson: [C: 031] Set $wgRateLimits['badcaptcha'] to counter bots [mediawiki-config] - 10https://gerrit.wikimedia.org/r/195886 (https://phabricator.wikimedia.org/T92376) (owner: 10Nemo bis) [21:59:08] (03PS4) 10Dereckson: Prevent new wikis from using Graph: namespace [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206776 [22:01:24] (03CR) 10Dereckson: "PS4: prevent labs. and outreach. too, as there hasn't been anything on the concerned namespaces yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206776 (owner: 10Dereckson) [22:01:41] (03PS3) 10Dereckson: Enable Graph extension on sv.wikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206777 (https://phabricator.wikimedia.org/T97027) [22:04:10] PROBLEM - puppet last run on mw2106 is CRITICAL puppet fail [22:05:54] !log tstarling Synchronized php-1.26wmf2/extensions/SecurePoll: for new voterList.php (duration: 00m 23s) [22:06:04] Logged the message, Master [22:06:41] anomie, hi! here's marcel from analytics. I don't know if you are the one to contact, but this url: http://meta.wikimedia.org/w/api.php?action=jsonschema&title=MobileWikiAppArticleSuggestions&revid=11448426 is returning strange empty values for the required fields, and I thought that it would maybe have something to do with: https://gerrit.wikimedia.org/r/#/c/207274/1 . Do you have any idea? [22:07:27] !log running bv2015/voterList.php on terbium [22:07:32] Logged the message, Master [22:11:12] (03PS4) 10Yuvipanda: tools: Create separate /tmp LVM volume for all exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/207157 (https://phabricator.wikimedia.org/T97445) [22:12:59] did you know that we have a user called ","? [22:13:17] lol [22:13:22] I did know we had a user called "0" [22:13:31] and that that broke one of legoktm 's SULF scripts [22:13:43] just checking the output of voterList.php, and it starts with the top lexically sorted names [22:13:48] (03PS3) 10Dzahn: Removed scs-c8-codfw from DNS mgmt files [dns] - 10https://gerrit.wikimedia.org/r/206157 (https://phabricator.wikimedia.org/T84737) (owner: 10Papaul) [22:14:02] there is a lot of stupid crap [22:14:59] "," has 673 edits in the long period and 59 edits in the short period btw [22:15:02] (03CR) 10Dzahn: [C: 032] Removed scs-c8-codfw from DNS mgmt files [dns] - 10https://gerrit.wikimedia.org/r/206157 (https://phabricator.wikimedia.org/T84737) (owner: 10Papaul) [22:15:44] there is also "--" and "-- -- --" [22:18:30] TimStarling: there's also a user named Special:Userlogout :P [22:18:38] (https://phabricator.wikimedia.org/T5507) [22:21:20] (03PS2) 10Dzahn: site.pp: add labcontrol1001 [puppet] - 10https://gerrit.wikimedia.org/r/206486 (https://phabricator.wikimedia.org/T96048) [22:22:29] RECOVERY - puppet last run on mw2106 is OK Puppet is currently enabled, last run 18 seconds ago with 0 failures [22:22:45] (03CR) 10Dzahn: [C: 032] site.pp: add labcontrol1001 [puppet] - 10https://gerrit.wikimedia.org/r/206486 (https://phabricator.wikimedia.org/T96048) (owner: 10Dzahn) [22:25:11] thanks mutante [22:25:17] do we have a User:Null ? [22:25:23] Yes [22:25:26] also User:0 [22:25:26] heh [22:25:30] and User: (empty string) [22:25:37] :o [22:25:41] Although that one is broken beyond what's reasonable [22:35:19] * subbu is amused with the user names [22:36:09] PROBLEM - puppet last run on ganeti2001 is CRITICAL puppet fail [22:39:49] (03PS1) 10coren: Make labs_lvm::volume understand swap fstype [puppet] - 10https://gerrit.wikimedia.org/r/207296 [22:39:52] YuviPanda: ^^ [22:40:32] Coren: do all the options to mkfs work with mkswap too? [22:41:06] YuviPanda: Most will not - just make sure that when you specify fstype to 'swap' you only specify useful ones. :-) [22:41:15] hoo: 21. Apr. 2015 Maintenance script (Diskussion | Beiträge) hat Benutzer „Hoo“ (mit 6 Bearbeitungen) in „Hoo~dewiki“ umbenannt (SUL finalization) [22:41:33] YuviPanda: The same applies in re ext4 vs xfs, etc. [22:41:36] hmm, that seems a bit of a nightmareish... [22:41:38] Coren: true [22:41:44] Coren: can you put a comment there to that effect? [22:41:55] I guess that’s just generally something you should be wary of... [22:42:00] mutante: Yeah, I own the global account Hoo, but I'm not actively using it [22:42:16] I had to cheat in order to get it, though [22:42:25] cheat = paid springle? [22:42:37] YuviPanda: Where? That if you specify mkfs_opt parameter you need to make them reasonable for the filesystem you are creating? [22:42:45] (the default is no options) [22:42:54] No, I renamed an account with a lot of edits of mine, so that I could claim the global one, then I renamed it back to Hoo man [22:42:59] on mediawiki.org, I think [22:43:00] hoo: i was surprised mine was actually unified :) [22:43:05] that was before I got shell [22:43:10] hoo: gotcha:) [22:43:50] (03CR) 10Yuvipanda: [C: 031] Make labs_lvm::volume understand swap fstype [puppet] - 10https://gerrit.wikimedia.org/r/207296 (owner: 10coren) [22:43:55] Coren: fair enough ^ +1’d [22:44:29] (03CR) 10coren: [C: 032] "This just adds a new fs type, and is noop for those not using it." [puppet] - 10https://gerrit.wikimedia.org/r/207296 (owner: 10coren) [22:47:55] (03PS5) 10Yuvipanda: tools: Create separate /tmp LVM volume for all exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/207157 (https://phabricator.wikimedia.org/T97445) [22:48:04] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Create separate /tmp LVM volume for all exec nodes [puppet] - 10https://gerrit.wikimedia.org/r/207157 (https://phabricator.wikimedia.org/T97445) (owner: 10Yuvipanda) [22:52:18] (03PS1) 10RobH: new logstash partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/207300 [22:52:41] RECOVERY - puppet last run on ganeti2001 is OK Puppet is currently enabled, last run 10 seconds ago with 0 failures [22:54:15] (03PS1) 10Yuvipanda: tools: Have node::compute::general inherit from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/207302 [22:54:32] (03PS2) 10Yuvipanda: tools: Have node::compute::general inherit from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/207302 [22:54:48] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Have node::compute::general inherit from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/207302 (owner: 10Yuvipanda) [22:55:42] (03CR) 10RobH: [C: 032] new logstash partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/207300 (owner: 10RobH) [22:55:44] greg-g: there's a UBN eventlogging issue due to the API changes in wmf3, I'm backporting a patch for it now and will sneak it in before the swat window starts... [22:56:17] [22:56:17] µ [22:56:20] pppppp§µ [22:56:40] did someone just merge my change on palldium? [22:57:01] fine if so, just wanted to make sure im not losing my mind... [22:57:37] YuviPanda: you merge my stuff? [22:57:44] (i just see you +2ing stuff is all) [22:58:15] robh: yeah, I think yours got caught in the puppet merge [22:58:20] ok, cool [22:58:29] PROBLEM - puppet last run on carbon is CRITICAL Puppet last ran 4 hours ago [22:58:30] !log legoktm Synchronized php-1.26wmf3/extensions/EventLogging/includes/ApiJsonSchema.php: https://gerrit.wikimedia.org/r/#/c/207297/ (duration: 00m 15s) [22:58:34] (03PS1) 10Yuvipanda: gridengine: Include base class in exec_host [puppet] - 10https://gerrit.wikimedia.org/r/207304 [22:58:37] Logged the message, Master [22:58:53] i just spent my brain power on partman, i didnt wanna have an odd git issue too ;D [23:00:04] RoanKattouw, ^d, bd808, Dereckson, gwicke, James_F: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150428T2300). [23:00:07] legoktm: kk [23:00:11] RECOVERY - puppet last run on carbon is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:00:14] * James_F waves. [23:00:48] * RoanKattouw has a meeting and can't do SWAT today [23:00:51] andrewbogott_afk: welcome. i am adding it to puppet with standard, just signed cert and initial run and stuff. you can take it from there and just add a role later [23:01:20] (03PS1) 10Hoo man: Do overrides for test wikis in Wikibase-production.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207305 [23:01:28] Is anyone swatting? [23:01:33] Have a good meeting RoanKattouw. [23:01:34] I'd like to push the above change [23:01:37] (03PS2) 10Yuvipanda: gridengine: Include base class in exec_host [puppet] - 10https://gerrit.wikimedia.org/r/207304 [23:01:40] no-op for production [23:01:44] (03CR) 10Yuvipanda: [C: 032 V: 032] gridengine: Include base class in exec_host [puppet] - 10https://gerrit.wikimedia.org/r/207304 (owner: 10Yuvipanda) [23:02:01] hoo: SWAT hasn't been started yet [23:02:07] robh: great, because I seem to be running into odd git issues today [23:02:14] Dereckson: Ok [23:02:16] aude: here? [23:02:24] YuviPanda: i read that as 'ive used git today' ;D [23:02:40] (03PS1) 10coren: Labs: support explicit labs_lvm::swap class [puppet] - 10https://gerrit.wikimedia.org/r/207306 [23:02:45] YuviPanda: ^^ [23:03:07] (03CR) 10Tim Landscheidt: "Why? It is already included via role::labs::tools::*, and in this way I am not sure if $gridmaster is set or causes Puppet failures." [puppet] - 10https://gerrit.wikimedia.org/r/207304 (owner: 10Yuvipanda) [23:03:20] RECOVERY - Check status of defined EventLogging jobs on eventlog1001 is OK All defined EventLogging jobs are runnning. [23:03:26] So with Roan out, swat is either ^d or one of us other deployers with patches in [23:03:51] (03CR) 10Hoo man: [C: 032] "No-op for production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207305 (owner: 10Hoo man) [23:04:43] bd808: ^d is idle 40 minutes, so… [23:04:52] (03CR) 10Yuvipanda: "So right now puppet is disabled on all current exec hosts and I'm basically hacking my way through a saner structure for the new ones that" [puppet] - 10https://gerrit.wikimedia.org/r/207304 (owner: 10Yuvipanda) [23:05:12] bd808: Could you? [23:05:18] k. I can do it but I need to switch laptops [23:05:26] * bd808 will be right back [23:05:33] (03Merged) 10jenkins-bot: Do overrides for test wikis in Wikibase-production.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207305 (owner: 10Hoo man) [23:06:40] !log hoo Synchronized wmf-config/: Do Wikibase setting overrides for test wikis in Wikibase-production.php (duration: 00m 24s) [23:06:46] Logged the message, Master [23:06:52] matt_flaschen: that fixed the DB error [23:07:10] hoo, thanks. [23:08:14] robh: ok, so puppet merge is fucked [23:08:31] ? [23:08:39] Dereckson: ready to test stuff? [23:08:41] (03PS3) 10Yuvipanda: tools: Have node::compute::general inherit from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/207302 [23:08:45] my change hit =] [23:08:46] robh: oh, no, actually, I just didn’t hit submit. [23:08:49] robh: I’m an idiot so ignore [23:08:57] (03CR) 10Yuvipanda: [V: 032] tools: Have node::compute::general inherit from toollabs [puppet] - 10https://gerrit.wikimedia.org/r/207302 (owner: 10Yuvipanda) [23:09:06] (03PS3) 10Yuvipanda: gridengine: Include base class in exec_host [puppet] - 10https://gerrit.wikimedia.org/r/207304 [23:09:12] considering the partman stuff i had, im not about to judge anyone [23:09:13] (03CR) 10Yuvipanda: [V: 032] gridengine: Include base class in exec_host [puppet] - 10https://gerrit.wikimedia.org/r/207304 (owner: 10Yuvipanda) [23:09:20] i had an issue i fought with and finally bitched about it [23:09:32] and as i typed the complaint and hit enter, i figured it out. [23:09:36] bd808|deploy: in a few minutes [23:09:55] Confirmed, Beta's working now. [23:10:07] robh: it's https://en.wikipedia.org/wiki/Rubber_duck_debugging [23:10:20] James_F: how about you? [23:10:30] bd808|deploy: I'm ready. [23:10:41] first up then is the VE bump [23:10:45] bd808|deploy: Thanks! [23:11:32] whoa, bd808|deploy's doing swat? :) [23:11:47] I'm not a PM any more!! [23:11:57] mutante: yea... irc is the duck [23:12:31] bd808|deploy: hooray, so you have time again to swat?:) [23:13:08] Well I do when I have patches in the window and nobody else to trick into doing it ;) [23:13:17] (03PS3) 10BBlack: de-dupe /static hashing for text/mobile [puppet] - 10https://gerrit.wikimedia.org/r/206878 (https://phabricator.wikimedia.org/T95448) [23:13:28] James_F: please tell me this doesn't need a full scap [23:13:46] bd808|deploy: It definitely doesn't. [23:13:58] * bd808|deploy wipes sweat from brow [23:14:31] bd808|deploy: It's a SWAT. I'd have had the courtesy to mention it, at least (ideally, I'd know that scaps aren't meant to happen in SWATs). :-) [23:14:53] (03CR) 10BBlack: [C: 032] "Should be a no-op against current prod traffic (confirmed via varnishlog) due to lack of /static requests." [puppet] - 10https://gerrit.wikimedia.org/r/206878 (https://phabricator.wikimedia.org/T95448) (owner: 10BBlack) [23:15:03] Not all are so kind James_F [23:15:09] bd808|deploy: Too true. :-( [23:15:28] and hell how many people know when you really need one? [23:15:35] the commit is only touching extensions/VisualEditor... it definitely doesn't need a full scap [23:15:54] bd808|deploy: It's only needed when you touch i18n, right? [23:16:03] yeah. [23:16:11] (Or ResourceLoader's internal gubbins, but if you're doing that in a SWAT you're doomed.) [23:17:13] Where do I log to beta's SAL? [23:17:18] (03PS1) 10Yuvipanda: tools: Move compute node code into tools::compute [puppet] - 10https://gerrit.wikimedia.org/r/207310 [23:17:24] (03CR) 10jenkins-bot: [V: 04-1] tools: Move compute node code into tools::compute [puppet] - 10https://gerrit.wikimedia.org/r/207310 (owner: 10Yuvipanda) [23:17:28] hoo: #wikimedia-releng? [23:17:28] (03PS2) 10Yuvipanda: tools: Move compute node code into tools::compute [puppet] - 10https://gerrit.wikimedia.org/r/207310 [23:17:30] hoo, -releng probably? [23:17:34] hoo: !log from #wikimedia-releng [23:17:36] (03CR) 10Yuvipanda: [C: 032 V: 032] tools: Move compute node code into tools::compute [puppet] - 10https://gerrit.wikimedia.org/r/207310 (owner: 10Yuvipanda) [23:17:41] Thanks [23:17:51] Three people == must be right, even if two of us were unsure. [23:18:20] It goes here -- https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [23:19:06] the slowness of mediawiki-phpunit-zend is astouding [23:19:19] a single global SAL would be useful for debugging ? [23:19:29] mutante: Not really. [23:19:40] mutante: Non-prod SAL noise is bad when debugging real issues. [23:19:53] (I'd have thought. Wiser others may disagree.) [23:20:11] a much better SAL than what we do today might be. With search and filters and cool stuff like that [23:20:23] bd808|deploy++ [23:20:29] wikis are great, except for when they aren't [23:20:53] PROBLEM - puppet last run on cp4010 is CRITICAL Puppet has 2 failures [23:21:23] PROBLEM - puppet last run on cp4016 is CRITICAL Puppet has 2 failures [23:21:46] what's the problem with search on wiki? i just searched for a random line from that SAL on wikitech and it found the page [23:21:54] PROBLEM - puppet last run on cp1065 is CRITICAL Puppet has 2 failures [23:22:06] bd808|deploy: That's just crazy talk. ;-) [23:22:33] PROBLEM - puppet last run on cp3007 is CRITICAL Puppet has 2 failures [23:22:36] well that's no good [23:22:44] PROBLEM - puppet last run on cp1066 is CRITICAL Puppet has 2 failures [23:22:53] PROBLEM - puppet last run on cp4017 is CRITICAL Puppet has 2 failures [23:23:21] oh VCL, you're my favorite language ever [23:23:44] PROBLEM - puppet last run on cp3030 is CRITICAL Puppet has 2 failures [23:24:36] there will be more of those, they're not hurting anything ^ [23:24:36] bd808|deploy: Sync so I can test? [23:24:56] James_F: almost there. My submod update muscles are rusty [23:25:08] bd808|deploy: Ha. Fun, isn't it? Sorry. [23:25:32] (03PS1) 10BBlack: bugfix for f39a6912 [puppet] - 10https://gerrit.wikimedia.org/r/207311 [23:25:33] VE lives to confuse. [23:25:44] PROBLEM - puppet last run on cp1059 is CRITICAL Puppet has 2 failures [23:25:44] PROBLEM - puppet last run on cp3013 is CRITICAL Puppet has 2 failures [23:25:45] (03CR) 10BBlack: [C: 032 V: 032] bugfix for f39a6912 [puppet] - 10https://gerrit.wikimedia.org/r/207311 (owner: 10BBlack) [23:26:03] PROBLEM - puppet last run on cp3015 is CRITICAL Puppet has 2 failures [23:26:07] !log bd808 Synchronized php-1.26wmf3/extensions/VisualEditor: Update VisualEditor for two icon issues {{gerrit|207299}} (duration: 00m 27s) [23:26:14] Logged the message, Master [23:26:24] James_F: ^ test away [23:26:34] bd808|deploy: Thanks! [23:26:47] * Dereckson is there. [23:26:53] PROBLEM - puppet last run on cp1047 is CRITICAL Puppet has 2 failures [23:27:11] Dereckson: cool. you'll be up next [23:27:28] Dereckson: do you want these one at a time or in a big pile? [23:27:54] it's a logfile. there's a search box, it finds stuff in the log. why it needs the "wiki sucks" template i don't know. [23:28:11] bd808|deploy: Confirmed working. Thanks! [23:28:13] a big pile will be fine, my test plan is ready. [23:28:34] RECOVERY - puppet last run on cp4010 is OK Puppet is currently enabled, last run 23 seconds ago with 0 failures [23:28:39] Dereckson: perfect. here we go then [23:28:49] (03PS2) 10BryanDavis: Add *.nasqueron.org to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207162 (https://phabricator.wikimedia.org/T97448) (owner: 10Dereckson) [23:28:58] (03CR) 10BryanDavis: [C: 032] Add *.nasqueron.org to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207162 (https://phabricator.wikimedia.org/T97448) (owner: 10Dereckson) [23:29:04] (03Merged) 10jenkins-bot: Add *.nasqueron.org to wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207162 (https://phabricator.wikimedia.org/T97448) (owner: 10Dereckson) [23:29:11] (03PS2) 10BryanDavis: Import sources configuration on mr.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206731 (https://phabricator.wikimedia.org/T96807) (owner: 10Dereckson) [23:29:14] PROBLEM - puppet last run on cp1055 is CRITICAL Puppet has 2 failures [23:29:20] (03CR) 10BryanDavis: [C: 032] Import sources configuration on mr.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206731 (https://phabricator.wikimedia.org/T96807) (owner: 10Dereckson) [23:29:24] PROBLEM - puppet last run on cp3012 is CRITICAL Puppet has 2 failures [23:29:24] PROBLEM - puppet last run on cp3040 is CRITICAL Puppet has 2 failures [23:29:24] PROBLEM - puppet last run on cp3016 is CRITICAL Puppet has 2 failures [23:29:24] PROBLEM - puppet last run on cp3014 is CRITICAL Puppet has 2 failures [23:29:27] (03Merged) 10jenkins-bot: Import sources configuration on mr.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/206731 (https://phabricator.wikimedia.org/T96807) (owner: 10Dereckson) [23:29:36] (03PS4) 10BryanDavis: Set up Babel categories for hu.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203783 (https://phabricator.wikimedia.org/T94842) (owner: 10Dereckson) [23:29:42] (03CR) 10BryanDavis: [C: 032] Set up Babel categories for hu.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203783 (https://phabricator.wikimedia.org/T94842) (owner: 10Dereckson) [23:29:48] (03Merged) 10jenkins-bot: Set up Babel categories for hu.wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/203783 (https://phabricator.wikimedia.org/T94842) (owner: 10Dereckson) [23:30:03] (03PS4) 10BryanDavis: Restrict local uploads on mai.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207273 (https://phabricator.wikimedia.org/T97397) (owner: 10Dereckson) [23:30:03] PROBLEM - puppet last run on cp3010 is CRITICAL Puppet has 2 failures [23:30:12] (03CR) 10BryanDavis: [C: 032] Restrict local uploads on mai.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207273 (https://phabricator.wikimedia.org/T97397) (owner: 10Dereckson) [23:30:19] (03Merged) 10jenkins-bot: Restrict local uploads on mai.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207273 (https://phabricator.wikimedia.org/T97397) (owner: 10Dereckson) [23:30:23] (03PS3) 10BryanDavis: Import sources on ne.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207170 (https://phabricator.wikimedia.org/T97396) (owner: 10Dereckson) [23:30:32] (03CR) 10BryanDavis: [C: 032] Import sources on ne.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207170 (https://phabricator.wikimedia.org/T97396) (owner: 10Dereckson) [23:30:38] (03Merged) 10jenkins-bot: Import sources on ne.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207170 (https://phabricator.wikimedia.org/T97396) (owner: 10Dereckson) [23:30:54] PROBLEM - puppet last run on cp4008 is CRITICAL Puppet has 2 failures [23:31:04] PROBLEM - puppet last run on cp3008 is CRITICAL Puppet has 2 failures [23:31:23] (03CR) 10Tim Landscheidt: "I don't know if the best way to achieve "consistency of code" and a "saner structure" is to disable Puppet and "hack one's way through". " [puppet] - 10https://gerrit.wikimedia.org/r/207304 (owner: 10Yuvipanda) [23:31:43] RECOVERY - puppet last run on cp3010 is OK Puppet is currently enabled, last run 3 seconds ago with 0 failures [23:32:28] (03CR) 10Yuvipanda: "It isn't in any way, but I don't know of any other way at the moment that doesn't involve setting up toolsbeta fully." [puppet] - 10https://gerrit.wikimedia.org/r/207304 (owner: 10Yuvipanda) [23:32:46] !log bd808 Synchronized commonsuploads.dblist: Restrict local uploads on mai.wikipedia {{gerrit|207273}} (duration: 00m 32s) [23:32:51] Logged the message, Master [23:33:54] PROBLEM - Host mw2031 is DOWN: PING CRITICAL - Packet loss = 100% [23:34:44] (03CR) 10Yuvipanda: "Note that you were right and this caused puppet failure too (and has been reverted in subsequent patch)" [puppet] - 10https://gerrit.wikimedia.org/r/207304 (owner: 10Yuvipanda) [23:35:06] !log bd808 Synchronized wmf-config/InitialiseSettings.php: Shell bugs {{gerrit|207162}} {{gerrit|206731}} {{gerrit|203783}} {{gerrit|207273}} {{gerrit|207170}} (duration: 01m 12s) [23:35:11] Logged the message, Master [23:35:17] Testing. [23:35:29] !log mw2031.codfw.wmnet syncing very slowly for SWAT [23:35:34] Logged the message, Master [23:35:44] RECOVERY - Host mw2031 is UPING OK - Packet loss = 0%, RTA = 43.34 ms [23:37:06] 207162 tested. For 207273, you propagated commonsuploads.dblist ? [23:37:17] Dereckson: yes. I did that first [23:37:24] RECOVERY - puppet last run on cp1066 is OK Puppet is currently enabled, last run 30 seconds ago with 0 failures [23:37:34] RECOVERY - puppet last run on cp4016 is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures [23:38:33] RECOVERY - puppet last run on cp3030 is OK Puppet is currently enabled, last run 55 seconds ago with 0 failures [23:38:55] RECOVERY - puppet last run on cp3007 is OK Puppet is currently enabled, last run 57 seconds ago with 0 failures [23:39:14] RECOVERY - puppet last run on cp4017 is OK Puppet is currently enabled, last run 11 seconds ago with 0 failures [23:39:27] There is an issue with 207273, but nothing is broken, so we can keep the change as is and I'll investigate / submit a follow-up. [23:40:03] RECOVERY - puppet last run on cp1065 is OK Puppet is currently enabled, last run 32 seconds ago with 0 failures [23:40:26] Dereckson: ok. update your bug so you don't forget :) [23:40:39] 203783 tested [23:42:23] RECOVERY - puppet last run on cp1059 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:42:24] RECOVERY - puppet last run on cp3013 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:42:35] RECOVERY - puppet last run on cp3015 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:43:23] RECOVERY - puppet last run on cp1047 is OK Puppet is currently enabled, last run 20 seconds ago with 0 failures [23:43:28] 206731 and 207170 look good. [23:43:37] gwicke: you around for your SWAT patch? [23:43:54] Thank you for the deploy. [23:44:02] Dereckson: that's all of them right? (except the one that didn't quite work) [23:44:12] (03PS1) 10Yuvipanda: Revert "tools: Have node::compute::general inherit from toollabs" [puppet] - 10https://gerrit.wikimedia.org/r/207313 [23:44:13] RECOVERY - puppet last run on cp1055 is OK Puppet is currently enabled, last run 35 seconds ago with 0 failures [23:44:17] (03CR) 10jenkins-bot: [V: 04-1] Revert "tools: Have node::compute::general inherit from toollabs" [puppet] - 10https://gerrit.wikimedia.org/r/207313 (owner: 10Yuvipanda) [23:44:23] RECOVERY - puppet last run on cp3012 is OK Puppet is currently enabled, last run 37 seconds ago with 0 failures [23:44:24] RECOVERY - puppet last run on cp3040 is OK Puppet is currently enabled, last run 13 seconds ago with 0 failures [23:44:28] (03PS1) 10Yuvipanda: Revert "tools: Move compute node code into tools::compute" [puppet] - 10https://gerrit.wikimedia.org/r/207314 [23:44:47] (03PS2) 10Yuvipanda: Revert "tools: Move compute node code into tools::compute" [puppet] - 10https://gerrit.wikimedia.org/r/207314 [23:44:54] (03CR) 10Yuvipanda: [C: 032 V: 032] Revert "tools: Move compute node code into tools::compute" [puppet] - 10https://gerrit.wikimedia.org/r/207314 (owner: 10Yuvipanda) [23:45:36] (03PS2) 10Yuvipanda: Revert "tools: Have node::compute::general inherit from toollabs" [puppet] - 10https://gerrit.wikimedia.org/r/207313 [23:45:40] bd808|deploy: indeed [23:45:44] (03CR) 10Yuvipanda: [C: 032 V: 032] Revert "tools: Have node::compute::general inherit from toollabs" [puppet] - 10https://gerrit.wikimedia.org/r/207313 (owner: 10Yuvipanda) [23:45:54] RECOVERY - puppet last run on cp4008 is OK Puppet is currently enabled, last run 40 seconds ago with 0 failures [23:46:04] RECOVERY - puppet last run on cp3016 is OK Puppet is currently enabled, last run 49 seconds ago with 0 failures [23:46:22] (03CR) 10Yuvipanda: "I seem to have massively underestimated what I had attempted to do. Reverted them all." [puppet] - 10https://gerrit.wikimedia.org/r/207304 (owner: 10Yuvipanda) [23:46:31] (03PS5) 10BryanDavis: Add AffCom user group application contact page on meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204205 (https://phabricator.wikimedia.org/T95789) (owner: 10Alex Monk) [23:46:41] (03CR) 10BryanDavis: [C: 032] Add AffCom user group application contact page on meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204205 (https://phabricator.wikimedia.org/T95789) (owner: 10Alex Monk) [23:47:01] (03Merged) 10jenkins-bot: Add AffCom user group application contact page on meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/204205 (https://phabricator.wikimedia.org/T95789) (owner: 10Alex Monk) [23:47:26] (03CR) 10Yuvipanda: "Am going to go around and test things on toolsbeta now." [puppet] - 10https://gerrit.wikimedia.org/r/207304 (owner: 10Yuvipanda) [23:47:43] RECOVERY - puppet last run on cp3008 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:47:43] RECOVERY - puppet last run on cp3014 is OK Puppet is currently enabled, last run 1 minute ago with 0 failures [23:48:04] PROBLEM - Kafka Broker Messages In on analytics1021 is CRITICAL: kafka.server.BrokerTopicMetrics.AllTopicsMessagesInPerSec.FifteenMinuteRate CRITICAL: 740.161805614 [23:48:27] !log bd808 Synchronized wmf-config/AffComContactPages.php: Add AffCom user group application contact page on meta {{gerrit|204205}} (duration: 00m 25s) [23:48:29] (03PS2) 10Dzahn: add codfw wtp parsoid servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/206478 (https://phabricator.wikimedia.org/T90271) [23:48:35] Logged the message, Master [23:49:31] grumble [23:50:28] !log bd808 Synchronized docroot/noc/createTxtFileSymlinks.sh: Add AffCom user group application contact page on meta {{gerrit|204205}} (duration: 00m 21s) [23:50:33] Logged the message, Master [23:50:45] sync-file wmf-config/CommonSettings.php "Add AffCom user group application contact page on meta {{gerrit|204205}}" [23:50:55] wrong window :/ [23:51:12] !log bd808 Synchronized wmf-config/CommonSettings.php: Add AffCom user group application contact page on meta {{gerrit|204205}} (duration: 00m 11s) [23:51:18] Logged the message, Master [23:51:20] well it sort of works: https://meta.wikimedia.org/wiki/Special:Contact/affcomusergroup [23:51:37] they didn't apply the CSS I recommended [23:51:44] lots of redlinks [23:52:21] Krenair: is that subject right? [23:52:27] "Contact message"? [23:52:37] the subject is supposed to be hidden by my css [23:52:43] ah [23:52:45] because the server will overwrite it with the group name [23:53:40] the noc symlinks was wrong too. I'll fix that in a followup [23:54:44] (03PS3) 10Dzahn: add codfw wtp parsoid servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/206478 (https://phabricator.wikimedia.org/T90271) [23:55:04] (03PS2) 10Dzahn: parsoid: add role::parsoid::prod to codfw nodes [puppet] - 10https://gerrit.wikimedia.org/r/206479 (https://phabricator.wikimedia.org/T90271) [23:55:22] (03PS1) 10BryanDavis: Fix noc symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207319 (https://phabricator.wikimedia.org/T95789) [23:55:43] (03CR) 10Dzahn: [C: 032] add codfw wtp parsoid servers to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/206478 (https://phabricator.wikimedia.org/T90271) (owner: 10Dzahn) [23:55:56] Krenair: So the css needs to be fixed in the extension right? [23:56:02] (03CR) 10BryanDavis: [C: 032] Fix noc symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207319 (https://phabricator.wikimedia.org/T95789) (owner: 10BryanDavis) [23:56:07] (03Merged) 10jenkins-bot: Fix noc symlink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/207319 (https://phabricator.wikimedia.org/T95789) (owner: 10BryanDavis) [23:56:08] bd808|deploy, what? [23:56:21] "they didn't apply the CSS I recommended" [23:56:40] yeah, the users [23:56:49] it's something to be done on-wiki [23:57:04] for our specific contactpage config [23:57:13] can't go in the extension [23:57:15] ah. ok [23:57:29] YuviPanda: merge conflict [23:57:34] !log bd808 Synchronized docroot/noc/conf/AffComContactPages.php.txt: Add AffCom user group application contact page on meta {{gerrit|207319}} (duration: 00m 28s) [23:57:37] mutante: ? [23:57:42] Logged the message, Master [23:57:47] YuviPanda: Yuvipanda: Revert "tools: Have node::compute::general inherit from toollabs" (7e5c331) [23:57:55] ‘merge conflict’? [23:58:02] oh [23:58:03] WARNING: Revision range includes commits from multiple committers! [23:58:04] puppet-merge [23:58:05] that [23:58:13] I just merged. sorry [23:58:27] ok [23:58:40] oh.... shit [23:58:45] bd808|deploy, I missed something [23:58:54] k. can we fix it? [23:59:10] the EmailUserForm hook needs to check we're actually dealing with the right form [23:59:17] in AffComContactPages.php