[00:00:04] Krinkle: good question. I totally could, yeah. just an extra hop, but that is probably not be a bad deal. however, most of tools aren't behind the proxy, so it is a question of consistency [00:00:17] I'm excited about *.tools for web tools though. That'll be very useful for compartmentalising things in the front-end [00:00:26] giving tools more power [00:00:29] Krinkle: it also complicates failover procedures though, since if the proxy goes down it'll take tools bits with it [00:01:27] Krinkle: yeah. it'll also make cookie handling simpler [00:01:37] Krinkle: tools will have to opt-in to it though. [00:01:58] we may want to formalise it a bit to avoid unwanted duplicate entry points [00:02:16] Making one of them the default [00:02:19] Krinkle: it might be one of the new things available with k8s webservices. I spent this week solidifying our k8s infra, I'm going to start looking into the webservice code next week [00:02:52] Krinkle: yeah. start with /$toolname as default, and have $toolname.tools redirect, and then slowly allow people to switch and then change default eventually [00:03:06] if we make *.tools the new default, we could figure a way to make /tools for existing tools and allow people to switch between them with a webservice flag of sorts [00:03:11] Yeah [00:03:13] :) [00:03:16] :D [00:03:54] k8s will also allow us to support more platforms easily. Like php7 / hhvm, python3, java8, ruby, etc [00:04:15] YuviPanda: So there's 2 things I'd like to try with nagf at some point. 1) use redis instead of fs as cache. 2) Have 2 instances round-robinned (I think we did that already, to allow graceful restart) [00:04:28] mostly as excercise so I can apply it to more serious tools. [00:04:55] in order to use redis though, i need to be able to test it locally [00:04:57] Krinkle: right. we can do (2) trivially right now [00:04:58] Or rather I want to :D [00:05:10] which probably depends on the docker image and stuff [00:05:13] Krinkle: ah, you're in luck, since I just spent yesterday testing the singlenode kubernetes install [00:05:33] Krinkle: it's now much easy to setup a local install. let me find the docs - took me less than 5mins (assuming you already have docker locally installed) [00:05:37] or full on k8s. Though I guess that's not needed to test locally? [00:05:44] Krinkle: http://kubernetes.io/docs/getting-started-guides/docker/ [00:06:06] Krinkle: so that's 'full on k8s', and that's what I'm using now. isn't too difficult to setup. [00:06:15] Cool [00:06:19] Krinkle: I'm considering adding it to mwv. [00:06:25] Yeah I mean more is better in this regard. Better reflection of what'll be in tools prod. [00:06:27] bd808: ^ what do you think? [00:06:33] That way I can test the k8s config locally as well [00:06:40] and experiment and learn how deployments work for example [00:06:44] Krinkle: yup. [00:07:07] Krinkle: there'll be a minor difference though, which is the user account your code runs as. but that's unavoidable, I guess. [00:07:13] yeah [00:07:28] and no NFS, no labsdb access, etc - but those are all positive anyway :D [00:07:37] But if we adhere to keeping things as stateless as possible on tools-k8s there shouldn't be anything one could accidentally depend on [00:07:44] NFS, yeah [00:07:48] labsdb, not so sure [00:08:08] yeah. [00:08:13] I've been kind of staying away from maintenance on tools that use labsdb since it's a pain to debug or work with [00:08:31] so one thing I eventually want to have is easy to use mysql tunnels that work for people [00:08:52] basically restorting to non-GUI editing over ssh on a second copy of the tool. Or trying to hack a way to connect from localhost which I can never get to work properly. [00:09:12] right. so I want us to at some point spend time in making that work properly [00:09:16] Also tried mocking my mysql abstraction class with sample data instead. Worked well actually. [00:09:33] seems like a local environment for k8s could make that work as well. [00:09:38] yeah [00:09:45] Krinkle: are you on linux or OS X? [00:09:49] OSX [00:10:05] These days Windows is getting closer to Linux than OSX. [00:10:11] Or so I hear :D [00:10:18] heh :) [00:10:37] They added some posix bindings in the latest dev preview that basically make ubuntu software work as-is. [00:10:39] Krinkle: I'll be curious to hear how setting up the kube singlenode on OS X goes :) Do you have docker setup? [00:10:57] YuviPanda: Yep, you helped me set that up so I can push to dockerhub for nagf [00:11:07] which has worked fine so far. I did a couple of updates [00:11:26] Krinkle: aaah, I forgot. yeah, in that case this document should work. [00:11:47] https://gist.github.com/Krinkle/f3acdb293bd0b37c838221487d8538a4#note [00:11:57] Krinkle: we also just upgraded to kubernetes 1.2 which has the 'deployment' object support better setup. easier to do 0 downtime deploys [00:12:42] Hm.. ok. so you'd like me to do http://kubernetes.io/docs/getting-started-guides/docker/ ? [00:13:04] Krinkle: yes [00:13:55] Krinkle: btw, nagf is now running 3 instances roundrobined. I just ran 'kubectl --context=nagf scale --replicas=3 rc/nagf' [00:15:40] 6Operations: Remove yana@ from legal-tm-vio@wikimedia.org - https://phabricator.wikimedia.org/T132230#2192207 (10Krenair) This is an exim alias under wikimedia.org, so ops has to do it, it's not lists.wikimedia.org (and if it were, this request would have to go to the lists' admins rather than a #Wikimedia-Mail... [00:17:26] goddamit. my flatmate is back here, fucking putting on Donald Trump [00:17:31] ten more days [00:18:03] and then I'm outta here. grr [00:18:16] YuviPanda: that's stored in a yaml file right? I'm on tools-k8s but can't find it [00:18:36] Krinkle: I just ran it on the commandline [00:18:55] YuviPanda: Yeah, but there was this yaml file at some point [00:19:01] Krinkle: yeah [00:20:46] YuviPanda: Hm.. so that docker command is gonna make bindings to directories on my main drive. Most of which don't exist afaik. [00:20:53] Or should I run it as docker inside docker? [00:21:28] Krinkle: nope, it should work I think [00:21:38] e.g. '--volume=/:/rootfs:ro \' [00:22:39] Krinkle: it's mounting / on your host to /rootfs in the container [00:22:47] Oh, right [00:22:50] to:from [00:22:57] Or, well, depends on perspective [00:23:26] Krinkle: and since you're running it on OS X, you are actually running it inside a VM, so that's mounting the VM's / I guess [00:24:06] so the 'docker' command I have installed internally routes it to the VM? [00:24:10] That's not entirely obvious [00:25:05] Krinkle: am pretty sure that's what docker does on OS X [00:25:14] Hm.. well, the command doesn't work plain. [00:25:26] I need to be inside the environment which.. is indeed a shell to a virtualbox vm [00:25:33] :) [00:25:47] nice :) [00:26:03] this is one of the primary reasons I switched back to Linux [00:26:56] Krinkle: use 1.2.1 for K8S_VERSION [00:27:12] * Krinkle Was about to type v1.2.2 [00:27:36] :) I don't think that's out yet (could be thoug) [00:27:43] I haven't checked in a few days [00:27:53] It told me to look at https://github.com/kubernetes/kubernetes/releases [00:27:54] doesn't matter, should work on any 1.2 series [00:27:58] ok :) [00:28:13] This is probably the biggest command I'm running copy-paste from the internet [00:28:18] That never goes wrong [00:28:22] And never gets old [00:28:22] hehe [00:29:00] Alrighty fetching :) [00:29:14] I'll let you know how it works out later. Gonna run now [00:29:15] o/ [00:30:05] Krinkle: cya [00:34:03] Krinkle: https://gist.github.com/Krinkle/f3acdb293bd0b37c838221487d8538a4#note I added the YAML file there [00:34:24] PROBLEM - puppet last run on mw2034 is CRITICAL: CRITICAL: puppet fail [01:02:53] RECOVERY - puppet last run on mw2034 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [02:06:31] (03PS1) 10Yurik: Match JsonConfig change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282447 [02:07:32] (03PS2) 10Yurik: Match JsonConfig change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282447 [02:23:18] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.20) (duration: 10m 11s) [02:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:31:58] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Apr 9 02:31:58 UTC 2016 (duration 8m 40s) [02:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:36:36] 6Operations, 10Mail: Remove yana@ from legal-tm-vio@wikimedia.org - https://phabricator.wikimedia.org/T132230#2192235 (10Peachey88) [03:05:23] PROBLEM - Kafka Broker Replica Max Lag on kafka1012 is CRITICAL: CRITICAL: 65.52% of data above the critical threshold [5000000.0] [03:43:04] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 68.97% of data above the critical threshold [5000000.0] [03:51:42] RECOVERY - Kafka Broker Replica Max Lag on kafka1012 is OK: OK: Less than 50.00% above the threshold [1000000.0] [04:36:52] RECOVERY - Kafka Broker Replica Max Lag on kafka1013 is OK: OK: Less than 50.00% above the threshold [1000000.0] [04:45:23] PROBLEM - Kafka Broker Replica Max Lag on kafka1018 is CRITICAL: CRITICAL: 68.97% of data above the critical threshold [5000000.0] [04:46:14] PROBLEM - Kafka Broker Replica Max Lag on kafka1014 is CRITICAL: CRITICAL: 72.41% of data above the critical threshold [5000000.0] [05:28:13] RECOVERY - Kafka Broker Replica Max Lag on kafka1018 is OK: OK: Less than 50.00% above the threshold [1000000.0] [05:32:53] RECOVERY - Kafka Broker Replica Max Lag on kafka1014 is OK: OK: Less than 50.00% above the threshold [1000000.0] [06:31:32] PROBLEM - puppet last run on mw2129 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:33] PROBLEM - puppet last run on mw1158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:54] PROBLEM - puppet last run on mw1170 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:05] PROBLEM - puppet last run on ms-be1010 is CRITICAL: CRITICAL: Puppet has 1 failures [06:56:34] RECOVERY - puppet last run on mw1170 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [06:56:53] RECOVERY - puppet last run on ms-be1010 is OK: OK: Puppet is currently enabled, last run 4 seconds ago with 0 failures [06:58:04] RECOVERY - puppet last run on mw1158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:12] RECOVERY - puppet last run on mw2129 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:35:31] 6Operations: Commons serving wrong image - https://phabricator.wikimedia.org/T132193#2192338 (10Aklapper) Please also make sure that you've [[ https://en.wikipedia.org/wiki/Wikipedia:Bypass_your_cache | bypassed your browser cache ]] before providing a screenshot. [07:37:26] 6Operations: Commons serving wrong image - https://phabricator.wikimedia.org/T132193#2192342 (10eranroz) 5Open>3Invalid [07:38:22] 6Operations: Commons serving wrong image - https://phabricator.wikimedia.org/T132193#2191090 (10eranroz) Closing as invalid - seems to be not reproducable (possibly I forgot to purge the browser cache) [07:59:08] 6Operations: Linking a bn.wikipedia.org button to G+ page. - https://phabricator.wikimedia.org/T109810#2192353 (10Aklapper) @JAlexander: Any chance to provide a status update here? [09:10:12] PROBLEM - check_mysql on lutetium is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 821 [09:20:12] RECOVERY - check_mysql on lutetium is OK: Uptime: 1969430 Threads: 1 Questions: 20170473 Slow queries: 11339 Opens: 118246 Flush tables: 2 Open tables: 64 Queries per second avg: 10.241 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [09:34:23] PROBLEM - puppet last run on mw2037 is CRITICAL: CRITICAL: Puppet has 1 failures [10:01:32] RECOVERY - puppet last run on mw2037 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:24:05] (03CR) 10MarcoAurelio: [C: 04-1] "I merged https://gerrit.wikimedia.org/r/#/c/281320/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/280456 (https://phabricator.wikimedia.org/T130514) (owner: 10Thcipriani) [11:05:36] !log Disabling tendril on es2005-es2010 (out of prod hosts) to avoid flooding logs of tendril DB - T129452 [11:05:37] T129452: Decommission es2001-es2010 - https://phabricator.wikimedia.org/T129452 [11:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:08:50] (03PS1) 10Nicko: T78342 excluding spec folders in .gitignore file [puppet] - 10https://gerrit.wikimedia.org/r/282461 [11:11:42] 6Operations, 10DBA, 13Patch-For-Review: Reimage db2047 - check for hardware errors - https://phabricator.wikimedia.org/T132011#2192484 (10Volans) [11:12:17] !log Disabling tendril on db2047 (needs to be reimaged) to avoid flooding logs of tendril DB - T132011 [11:12:18] T132011: Reimage db2047 - check for hardware errors - https://phabricator.wikimedia.org/T132011 [11:12:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:14:37] (03CR) 10Ladsgroup: ores: Scap3 deployment configurations (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/280403 (owner: 10Ladsgroup) [12:11:34] PROBLEM - Host mw1085 is DOWN: PING CRITICAL - Packet loss = 100% [12:12:52] RECOVERY - Host mw1085 is UP: PING OK - Packet loss = 0%, RTA = 1.15 ms [13:01:59] (03PS1) 10Jstenval: Simplification of Cassandra Logstash filtering [puppet] - 10https://gerrit.wikimedia.org/r/282466 (https://phabricator.wikimedia.org/T130861) [13:07:26] 6Operations, 10Continuous-Integration-Infrastructure, 13Patch-For-Review: Create a basic RSpec unit test for operations/puppet - https://phabricator.wikimedia.org/T78342#842870 (10Nicko) Sorry for being so dumb, but I had a hard time finding how to execute `bundle exec spec`, it was actually because the comm... [13:11:13] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:11:25] (03PS2) 10Jstenval: Simplification of Cassandra Logstash filtering [puppet] - 10https://gerrit.wikimedia.org/r/282466 (https://phabricator.wikimedia.org/T130861) [13:11:49] (03CR) 10Edomaur: [C: 031] T78342 excluding spec folders in .gitignore file [puppet] - 10https://gerrit.wikimedia.org/r/282461 (owner: 10Nicko) [13:12:12] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [13:14:51] (03CR) 10Edomaur: [C: 031] Simplification of Cassandra Logstash filtering [puppet] - 10https://gerrit.wikimedia.org/r/282466 (https://phabricator.wikimedia.org/T130861) (owner: 10Jstenval) [13:17:44] PROBLEM - puppet last run on db2009 is CRITICAL: CRITICAL: puppet fail [13:18:32] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:19:23] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:24:09] (03PS1) 10Dereckson: Set wgSemiprotectedRestrictionLevels for fr.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282469 (https://phabricator.wikimedia.org/T132248) [13:24:51] (03CR) 10Luke081515: [C: 031] Set wgSemiprotectedRestrictionLevels for fr.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282469 (https://phabricator.wikimedia.org/T132248) (owner: 10Dereckson) [13:26:34] (03PS1) 10Dereckson: Set wgSemiprotectedRestrictionLevels for de.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282471 (https://phabricator.wikimedia.org/T132249) [13:27:11] (03PS1) 10Adedommelin: Improve robustness of es-tool [puppet] - 10https://gerrit.wikimedia.org/r/282472 (https://phabricator.wikimedia.org/T128786) [13:27:22] (03CR) 10Edomaur: [C: 04-1] "Ok, no. spec/ is not only the directory with the test results." [puppet] - 10https://gerrit.wikimedia.org/r/282461 (owner: 10Nicko) [13:29:46] Dereckson: Can you tell would be the efect of $wgSemiprotectedRestrictionLevels? I'm currently unsure [13:30:46] Luke081515: it will consider editor protection as semi protection, not full protectio [13:31:08] For example, when you edit the page, instead of .mw-textarea-protected it will add .mw-textarea-sprotected class. [13:31:09] So I guess we have to change systemmessages again. What's the advantage of that? [13:31:29] (so the same design than semi protected, not the same design than sysop protected) [13:31:40] No need to change system message for this change. [13:32:28] There is two changes in core [13:32:32] (1) the CSS [13:32:49] Dereckson: So the effect is for example, that mw shows the message of semiprotection instead of full protection when you edit a page? Then we have to change one, because currently our full protection sysmessages contains a swithc for editeditorprotected [13:33:13] (2) A different notice to move page protected for move as this level: semiprotectedpagemovewarning instead of protectedpagemovewarning [13:33:16] I guess the best is, if I will wait till that is deployed to fr, then I can check out the situation ;) [13:33:22] okay [13:34:09] It's already active on en. by the way. [13:34:40] ok [13:35:31] (03CR) 10Nicko: [C: 031] Improve robustness of es-tool [puppet] - 10https://gerrit.wikimedia.org/r/282472 (https://phabricator.wikimedia.org/T128786) (owner: 10Adedommelin) [13:44:53] RECOVERY - puppet last run on db2009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:05:03] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 73.33% of data above the critical threshold [5000000.0] [14:24:29] (03PS1) 10Paladox: Fix viewing raw php files in diffusion [puppet] - 10https://gerrit.wikimedia.org/r/282478 [14:48:23] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [16:13:11] (03Abandoned) 10Nicko: T78342 excluding spec folders in .gitignore file [puppet] - 10https://gerrit.wikimedia.org/r/282461 (owner: 10Nicko) [16:23:50] (03PS1) 10Nicko: Modification of Rakefile [puppet] - 10https://gerrit.wikimedia.org/r/282484 (https://phabricator.wikimedia.org/T78342) [16:30:36] (03PS2) 10Nicko: Modification of Rakefile [puppet] - 10https://gerrit.wikimedia.org/r/282484 (https://phabricator.wikimedia.org/T78342) [16:35:46] (03CR) 10JanZerebecki: [C: 031] update the DNS record for benefactors.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/280637 (https://phabricator.wikimedia.org/T130937) (owner: 10Mschon) [16:35:59] (03CR) 10JanZerebecki: update the DNS record for benefactors.wikimedia.org (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/280637 (https://phabricator.wikimedia.org/T130937) (owner: 10Mschon) [16:37:01] (03CR) 10Gehel: Modification of Rakefile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/282484 (https://phabricator.wikimedia.org/T78342) (owner: 10Nicko) [16:37:10] (03CR) 10JanZerebecki: [C: 04-1] "Lets go with -all. Unless 20after4 has any concerns." [dns] - 10https://gerrit.wikimedia.org/r/280644 (https://phabricator.wikimedia.org/T116806) (owner: 10Mschon) [16:47:52] (03Abandoned) 10Alex Monk: Add blank mx secrets [labs/private] - 10https://gerrit.wikimedia.org/r/245139 (https://phabricator.wikimedia.org/T87848) (owner: 10Alex Monk) [16:48:39] 7Puppet, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Puppet failures on deployment-mx: can't find puppet://private/dkim/wikimedia.org-wiki-mail.key - https://phabricator.wikimedia.org/T87848#2192780 (10Krenair) 5Open>3Invalid puppet is no longer failing on that host [16:58:23] We're looking at T131760 (Add icinga monitoring for varnish statistics daemons). Is it just about adding a check that the service is running? [16:58:24] T131760: Add icinga monitoring for varnish statistics daemons - https://phabricator.wikimedia.org/T131760 [16:59:30] ema: any chance that you'd be around? [17:03:54] _joe_: ^ since you seem to be connected, any chance you'd have an idea about T131760? Feel free to remind me that we are Saturday and that you are actually spending quality time with the family... [17:03:54] T131760: Add icinga monitoring for varnish statistics daemons - https://phabricator.wikimedia.org/T131760 [17:09:25] I think most people here use bouncers which stay connected 24/7 [17:11:53] <_joe_> gehel: I spent quality time with the family, and now I'm exhausted [17:11:55] <_joe_> :P [17:11:56] Yep, I know, but I've seen messages from joe a few minutes ago. I doubt he is using a bot to post random comments on his behalf :) [17:12:33] _joe_: relax and enjoy the weekend! I'll ask clarification on the task... [17:12:42] <_joe_> gehel: I think basically all those daemons run as either cronjobs or systemd units [17:12:53] <_joe_> and we have no monitoring of said processes [17:15:29] _joe_: thanks! And sorry to disturb you on a Saturday evening. I'll have a look into it with pioupioo [17:22:04] PROBLEM - YARN NodeManager Node-State on analytics1039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:23:23] <_joe_> gehel: np I mostly promised to be around today [17:23:37] <_joe_> but well, I was summoned by the family in the end [17:23:53] _joe_: you do remember that today ends at midnight :P [17:25:33] RECOVERY - YARN NodeManager Node-State on analytics1039 is OK: OK: YARN NodeManager analytics1039.eqiad.wmnet:8041 Node-State: RUNNING [17:28:01] ---^ [2098948.579894] CPU14: Package temperature above threshold, cpu clock throttled (total events = 378274610) [17:28:05] ah snap [17:34:48] 6Operations, 6Analytics-Kanban: Analytics1039 host showed high temperature alarms - https://phabricator.wikimedia.org/T132256#2192798 (10elukey) [17:43:18] 6Operations, 6Analytics-Kanban: Analytics1039 host showed high temperature alarms - https://phabricator.wikimedia.org/T132256#2192813 (10elukey) https://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&c=Analytics+cluster+eqiad&m=cpu_report&s=by+name&mc=2&g=cpu_report {F3854845} [17:46:53] it is one of the host in the Hadoop cluster, nothing super heavy but we saw another host rebooting some days ago (without showing anything in dmesg or mcelog though) [17:59:34] (03CR) 10Gehel: "LGTM, but I have not actually tested it. We should probably merge this at the same time as the cleanup of commons-compiler and janino JARs" [puppet] - 10https://gerrit.wikimedia.org/r/282466 (https://phabricator.wikimedia.org/T130861) (owner: 10Jstenval) [18:01:49] 6Operations, 6Analytics-Kanban: Analytics1039 host showed high temperature alarms - https://phabricator.wikimedia.org/T132256#2192798 (10Southparkfan) Cmjohnson noted there have been multiple machines lately with CPU temperature issues, where re-applying thermal paste has been an effective fix. I think it's a... [18:02:42] PROBLEM - puppet last run on ms-be2015 is CRITICAL: CRITICAL: Puppet has 1 failures [18:03:42] PROBLEM - Kafka Broker Replica Max Lag on kafka1022 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [5000000.0] [18:11:08] (03CR) 10Gehel: [C: 032] CirrusSearch on Labs uses the full Elasticsearch cluster again [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282359 (https://phabricator.wikimedia.org/T130219) (owner: 10Gehel) [18:12:44] RECOVERY - Kafka Broker Replica Max Lag on kafka1022 is OK: OK: Less than 50.00% above the threshold [1000000.0] [18:23:42] (03PS1) 10Adedommelin: Add icinga monitoring for varnish statistics daemons [puppet] - 10https://gerrit.wikimedia.org/r/282488 (https://phabricator.wikimedia.org/T131760) [18:24:46] (03CR) 10jenkins-bot: [V: 04-1] Add icinga monitoring for varnish statistics daemons [puppet] - 10https://gerrit.wikimedia.org/r/282488 (https://phabricator.wikimedia.org/T131760) (owner: 10Adedommelin) [18:29:33] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [18:29:33] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/). [18:30:03] RECOVERY - puppet last run on ms-be2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [18:49:29] (03PS2) 10Adedommelin: Add icinga monitoring for varnish statistics daemons [puppet] - 10https://gerrit.wikimedia.org/r/282488 (https://phabricator.wikimedia.org/T131760) [18:51:53] 7Puppet, 10Beta-Cluster-Infrastructure, 7Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2192864 (10Krenair) [18:52:16] 7Puppet, 10Beta-Cluster-Infrastructure, 7Tracking: Deployment-prep hosts with puppet errors (tracking) - https://phabricator.wikimedia.org/T132259#2192876 (10Krenair) [18:55:31] 7Puppet, 10Beta-Cluster-Infrastructure: Puppet failure on deployment-cache-parsoid05 due to removal of role::cache::parsoid - https://phabricator.wikimedia.org/T132260#2192879 (10Krenair) [18:58:38] (03CR) 1020after4: "I think -all makes sense. There shouldn't be any email sent from other IPs." [dns] - 10https://gerrit.wikimedia.org/r/280644 (https://phabricator.wikimedia.org/T116806) (owner: 10Mschon) [19:04:55] 7Puppet, 10Beta-Cluster-Infrastructure: deployment-memc0[2-4] puppet failure due to missing hiera data mediawiki::redis_servers::codfw - https://phabricator.wikimedia.org/T132262#2192909 (10Krenair) [19:07:11] 7Puppet, 10Beta-Cluster-Infrastructure: deployment-redis0[12], deployment-sentry2 puppet failures due to missing hiera data redis::shards - https://phabricator.wikimedia.org/T132263#2192922 (10Krenair) [19:07:21] 7Puppet, 10Beta-Cluster-Infrastructure: deployment-memc0[2-4] puppet failure due to missing hiera data mediawiki::redis_servers::codfw - https://phabricator.wikimedia.org/T132262#2192909 (10Krenair) [19:07:37] 7Puppet, 10Beta-Cluster-Infrastructure: deployment-memc0[2-4] puppet failure due to missing hiera data mediawiki::redis_servers::codfw - https://phabricator.wikimedia.org/T132262#2192909 (10Krenair) [19:08:00] 7Puppet, 10Beta-Cluster-Infrastructure: deployment-redis0[12], deployment-sentry2 puppet failures due to missing hiera data redis::shards - https://phabricator.wikimedia.org/T132263#2192922 (10Krenair) [19:25:52] 7Puppet, 10Beta-Cluster-Infrastructure, 6Services: deployment-(mathoid|sca0[12]|cxserver03) puppet failures "Error: Could not set home on user" due to user being in use by process - https://phabricator.wikimedia.org/T132265#2192956 (10Krenair) [19:51:51] (03PS3) 10Nicko: Modification of Rakefile [puppet] - 10https://gerrit.wikimedia.org/r/282484 (https://phabricator.wikimedia.org/T78342) [19:53:32] (03CR) 10Nicko: Modification of Rakefile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/282484 (https://phabricator.wikimedia.org/T78342) (owner: 10Nicko) [19:56:38] 7Puppet, 10Beta-Cluster-Infrastructure, 3Scap3: deployment-((sca|aqs)01|ores-web) fails due to scap3 errors - https://phabricator.wikimedia.org/T132267#2192991 (10Krenair) [20:12:31] gehel: don't forget about tin when merging mediawiki-config ^^(a bit above) [20:17:33] 7Puppet, 10Beta-Cluster-Infrastructure: deployment-puppetmaster puppet failures due to apache trying to start on same port as nginx - https://phabricator.wikimedia.org/T132269#2193022 (10Krenair) [20:24:15] 7Puppet, 10Beta-Cluster-Infrastructure: deployment-puppetmaster puppet failures due to apache trying to start on same port as nginx - https://phabricator.wikimedia.org/T132269#2193052 (10Krenair) a:320after4 I believe that nginx loads because of role::aptly https://wikitech.wikimedia.org/w/index.php?title=No... [20:30:04] 7Puppet, 10Beta-Cluster-Infrastructure: deployment-puppetmaster puppet failures due to apache trying to start on same port as nginx - https://phabricator.wikimedia.org/T132269#2193058 (10Krenair) a:520after4>3mmodell [20:42:28] (03PS1) 10Dereckson: Fix wgCopyUploadsDomains on Commons Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282495 [20:43:53] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [20:44:55] !log krenair@tin Synchronized wmf-config/LabsServices.php: sync labs-only merge of gehel's that was just left unmerged on tin (duration: 00m 41s) [20:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:45:42] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [20:51:47] (03PS2) 10Dereckson: Fix wgCopyUploadsDomains on Commons Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282495 [20:56:33] volans, Krenair: thanks a lot! The change impacts only labs, which is auto-deploy, I completely forgot to sync in on prod. Thanks! [21:16:55] 6Operations, 10Traffic, 6WMF-Communications, 7HTTPS, 7Security-Other: Server certificate is classified as invalid on government computers - https://phabricator.wikimedia.org/T128182#2193110 (10Reedy) [21:17:04] 6Operations, 10ops-eqiad, 6Analytics-Kanban: Analytics1039 host showed high temperature alarms - https://phabricator.wikimedia.org/T132256#2193111 (10Peachey88) [21:24:43] PROBLEM - Kafka Broker Replica Max Lag on kafka1020 is CRITICAL: CRITICAL: 68.97% of data above the critical threshold [5000000.0] [21:27:15] (03PS3) 10Yurik: Match JsonConfig change [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282447 [21:38:49] (03PS3) 10Reedy: Fix wgCopyUploadsDomains on Commons Beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282495 (owner: 10Dereckson) [22:12:55] (03PS1) 10Alex Monk: Remove old beta bits check [puppet] - 10https://gerrit.wikimedia.org/r/282498 [22:15:23] RECOVERY - Kafka Broker Replica Max Lag on kafka1020 is OK: OK: Less than 50.00% above the threshold [1000000.0] [22:51:39] PROBLEM - MariaDB Slave Lag: s1 on db1053 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 374.28 seconds [22:53:28] on it ^^ [22:57:10] RECOVERY - MariaDB Slave Lag: s1 on db1053 is OK: OK slave_sql_lag Replication lag: 14.48 seconds [23:17:13] PROBLEM - Kafka Broker Replica Max Lag on kafka1013 is CRITICAL: CRITICAL: 68.97% of data above the critical threshold [5000000.0] [23:37:03] 7Puppet, 10Beta-Cluster-Infrastructure: deployment-memc0[2-4] puppet failure due to missing hiera data mediawiki::redis_servers::codfw - https://phabricator.wikimedia.org/T132262#2193205 (10Krenair) {rOPUPa64cf8d6ff72d4326818f58410ae17e14bf498e4} {rOPUP47c7c2cdfa850394176ac9af53c7b51bbdf9bf80} {rOPUPf70cf107fa... [23:37:39] 7Puppet, 10Beta-Cluster-Infrastructure: deployment-memc0[2-4] puppet failure due to missing hiera data mediawiki::redis_servers::codfw - https://phabricator.wikimedia.org/T132262#2193206 (10Krenair) a:3Joe [23:38:00] (03CR) 10Alex Monk: "T132262" [puppet] - 10https://gerrit.wikimedia.org/r/273197 (https://phabricator.wikimedia.org/T127845) (owner: 10Giuseppe Lavagetto) [23:41:06] 7Puppet, 10Beta-Cluster-Infrastructure: deployment-redis0[12], deployment-sentry2 puppet failures due to missing hiera data redis::shards - https://phabricator.wikimedia.org/T132263#2193207 (10Krenair) a:3Joe {rOPUP3321bda2e501867e238f9152b68fb952801ace8b}