[00:03:35] <wikibugs>	 10Operations, 10cloud-services-team (Kanban): Labstore1006/7 profile for meltdown kernel - https://phabricator.wikimedia.org/T185101#4093533 (10madhuvishy) Reporting back here on what I found  So I ran some dd based tests to get baseline numbers  1. dd syncronous test with file larger than client RAM write  Ra...
[00:09:29] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4093537 (10ayounsi) >>! In T189252#4093427, @BBlack wrote: > [...] but @ayounsi can you confirm this whole list is go...
[00:10:23] <icinga-wm>	 RECOVERY - Outgoing network saturation on labstore1006 is OK: OK: Less than 10.00% above the threshold [93750000.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring
[00:11:23] <icinga-wm>	 RECOVERY - Incoming network saturation on labstore1007 is OK: OK: Less than 10.00% above the threshold [93750000.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring
[00:28:23] <wikibugs>	 10Operations, 10Page-Previews, 10RESTBase, 10Traffic, and 2 others: Cached page previews not shown when refreshed - https://phabricator.wikimedia.org/T184534#4093568 (10Jdlrobson)
[00:36:22] <wikibugs>	 10Operations, 10cloud-services-team (Kanban): Labstore1006/7 profile for meltdown kernel - https://phabricator.wikimedia.org/T185101#4093578 (10madhuvishy) I also ran various tests using fio across the 2 kernels over NFSd - https://tools.wmflabs.org/labstore-profiling/
[00:58:50] <marlier>	 ebernhardson: ugh, so sorry - forgot that I had a board meeting for an org I'm part of tonight. 
[01:08:27] <ebernhardson>	 marlier: np, timo covered for you
[01:09:14] <marlier>	 Oh, great. Krinkle is the man. Thank you! 
[01:26:30] <wikibugs>	 (03PS1) 10Madhuvishy: dns: Point dumps.wikimedia.org to labstore1007 [dns] - 10https://gerrit.wikimedia.org/r/423078 (https://phabricator.wikimedia.org/T188646)
[01:27:08] <wikibugs>	 (03PS2) 10Madhuvishy: dns: Point dumps.wikimedia.org to labstore1007 [dns] - 10https://gerrit.wikimedia.org/r/423078 (https://phabricator.wikimedia.org/T188646)
[01:27:32] <wikibugs>	 (03CR) 10Madhuvishy: [C: 04-2] "Putting up in prep of migration, don't merge yet." [dns] - 10https://gerrit.wikimedia.org/r/423078 (https://phabricator.wikimedia.org/T188646) (owner: 10Madhuvishy)
[01:27:56] <wikibugs>	 (03CR) 10Madhuvishy: [C: 04-2] dns: Point dumps.wikimedia.org to labstore1007 [dns] - 10https://gerrit.wikimedia.org/r/423078 (https://phabricator.wikimedia.org/T188646) (owner: 10Madhuvishy)
[01:37:00] <wikibugs>	 (03PS1) 10Dzahn: add IPv6 records for bromine.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/423079 (https://phabricator.wikimedia.org/T188163)
[01:41:58] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "puppet re-enabled on bromine. IP added on interface by puppet" [dns] - 10https://gerrit.wikimedia.org/r/423079 (https://phabricator.wikimedia.org/T188163) (owner: 10Dzahn)
[01:45:47] <wikibugs>	 (03PS1) 10Dzahn: cache::misc: add codfw backend for webserver_misc_static [puppet] - 10https://gerrit.wikimedia.org/r/423080 (https://phabricator.wikimedia.org/T188163)
[01:59:23] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 3 others: Image thumbnails of File:Ariano_Irpino_ZI.jpeg do not get purged despite many re-uploads - https://phabricator.wikimedia.org/T191028#4093616 (10Krinkle)
[02:03:27] <wikibugs>	 (03PS1) 10Madhuvishy: dumps: Move nfs exports config to template [puppet] - 10https://gerrit.wikimedia.org/r/423081 (https://phabricator.wikimedia.org/T181431)
[02:06:01] <wikibugs>	 (03CR) 10Madhuvishy: [C: 032] dumps: Move nfs exports config to template [puppet] - 10https://gerrit.wikimedia.org/r/423081 (https://phabricator.wikimedia.org/T181431) (owner: 10Madhuvishy)
[02:08:33] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 11 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#4093624 (10Tgr) Ugh, I am really sorry, I don't know how I could misread that so badly :( I guess my eye is trained to react to the word "Declined" in P...
[02:30:31] <icinga-wm>	 PROBLEM - Host dumpsdata1001 is DOWN: PING CRITICAL - Packet loss = 100%
[02:31:01] <icinga-wm>	 RECOVERY - Host dumpsdata1001 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms
[03:03:11] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))): 36658.67497507478 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:03:21] <icinga-wm>	 PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))): 37452.919960474304 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=
[03:04:11] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1004 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:04:21] <icinga-wm>	 RECOVERY - kubelet operational latencies on kubernetes1001 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1
[03:28:31] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 891.02 seconds
[03:54:32] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 165.53 seconds
[04:03:01] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 3 others: Image thumbnails of File:Ariano_Irpino_ZI.jpeg do not get purged despite many re-uploads - https://phabricator.wikimedia.org/T191028#4091369 (10Krinkle) It looks like there may be two issue going on here.  1. MediaWiki's cano...
[05:01:29] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 3 others: Image thumbnails of File:Ariano_Irpino_ZI.jpeg do not get purged despite many re-uploads - https://phabricator.wikimedia.org/T191028#4093691 (10Krinkle) Never mind about the part where I said I can reproduce it locally on pla...
[05:17:12] <icinga-wm>	 PROBLEM - BGP status on cr1-ulsfo is CRITICAL: BGP CRITICAL - AS1299/IPv4: Connect, AS1299/IPv6: Active
[05:20:31] <icinga-wm>	 PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 34 probes of 324 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map
[05:25:22] <icinga-wm>	 RECOVERY - BGP status on cr1-ulsfo is OK: BGP OK - up: 17, down: 1, shutdown: 0
[05:25:31] <icinga-wm>	 RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 1 probes of 324 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map
[05:57:43] <wikibugs>	 (03PS4) 10Elukey: Refactor hadoop/hive monitoring profiles to a simpler structure [puppet] - 10https://gerrit.wikimedia.org/r/423000 (https://phabricator.wikimedia.org/T167790)
[06:06:25] <wikibugs>	 (03CR) 10Elukey: "All consumers of profile::prometheus::jmx_agent (except the hadoop ones) are no-op: https://puppet-compiler.wmflabs.org/compiler03/10749/" [puppet] - 10https://gerrit.wikimedia.org/r/423000 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey)
[06:21:02] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 3 others: Image thumbnails of File:Ariano_Irpino_ZI.jpeg do not get purged despite many re-uploads - https://phabricator.wikimedia.org/T191028#4091369 (10Gilles) The first part of the thumb URL is insufficient information, because you'...
[07:01:46] <wikibugs>	 (03CR) 10Elukey: [C: 032] Refactor hadoop/hive monitoring profiles to a simpler structure [puppet] - 10https://gerrit.wikimedia.org/r/423000 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey)
[07:10:05] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "It also needs a bump to the Gemfile version for wmf_styleguide to 1.0.1 and most importantly a publishing of it at https://rubygems.org/ge" [puppet] - 10https://gerrit.wikimedia.org/r/420143 (owner: 10Dzahn)
[07:32:07] <yannf>	 https://phabricator.wikimedia.org/T190988 <- do you need more examples?
[07:37:54] <elukey>	 yannf: seems fine, but possibly releng is the one that will follow up? Are there actionables for ops? (genuinely asking, don't have a lot of context)
[07:39:19] <elukey>	 !log rolling restart of yarn-hadoop-nodemanagers on all the hadoop worker nodes after https://gerrit.wikimedia.org/r/423000
[07:39:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:39:33] <yannf>	 elukey, this didn't happen to my files, I've just spotted them as an admin
[07:41:01] <elukey>	 yannf: sure sure, I was only trying to figure out if there were actionables for ops (like a possible swift issue etc..) or not. Thanks for the report :)
[07:41:44] <yannf>	 ok, plz tell me if I can do anything
[07:43:13] <elukey>	 yannf: I'd say that we could wait for releng's response (or specifically the ones mentioned last in the task) and then see. Keep us in the loop if this doesn't progress
[07:43:54] <yannf>	 sure
[07:50:34] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 3 others: Image thumbnails of File:Ariano_Irpino_ZI.jpeg do not get purged despite many re-uploads - https://phabricator.wikimedia.org/T191028#4093755 (10Gilles) I think I've got it. MediaWiki just goes by what it finds in Swift, as is...
[07:51:14] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 3 others: Thumbor incorrectly normalizes .jpe and .jpeg into .jpg for Swift thumbnail storage - https://phabricator.wikimedia.org/T191028#4093756 (10Gilles) p:05Triage>03Normal a:03Gilles
[07:51:35] <wikibugs>	 (03PS9) 10Elukey: profile::zookeeper:server: add the support for prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/422920 (https://phabricator.wikimedia.org/T177460)
[08:21:27] <wikibugs>	 (03PS10) 10Elukey: profile::zookeeper:server: add the support for prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/422920 (https://phabricator.wikimedia.org/T177460)
[08:35:22] <wikibugs>	 (03CR) 10Elukey: "Looks good, plus tested in labs: https://puppet-compiler.wmflabs.org/compiler02/10750/" [puppet] - 10https://gerrit.wikimedia.org/r/422920 (https://phabricator.wikimedia.org/T177460) (owner: 10Elukey)
[08:38:19] <elukey>	 !log rolling restart of hadoop-hdfs-datanode on all the hadoop worker nodes after https://gerrit.wikimedia.org/r/423000
[08:38:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:54:31] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] "Some comments inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/421901 (https://phabricator.wikimedia.org/T190213) (owner: 10Elukey)
[08:59:38] <wikibugs>	 (03CR) 10Elukey: [C: 032] profile::zookeeper:server: add the support for prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/422920 (https://phabricator.wikimedia.org/T177460) (owner: 10Elukey)
[09:15:06] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] apertium-cat: New upstream release [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/421825 (https://phabricator.wikimedia.org/T189076) (owner: 10KartikMistry)
[09:16:18] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] apertium-lex-tools: New upstream release [debs/contenttranslation/apertium-lex-tools] - 10https://gerrit.wikimedia.org/r/419356 (https://phabricator.wikimedia.org/T189075) (owner: 10KartikMistry)
[09:17:33] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] apertium-fra: New upstream release [debs/contenttranslation/apertium-fra] - 10https://gerrit.wikimedia.org/r/421813 (https://phabricator.wikimedia.org/T189076) (owner: 10KartikMistry)
[09:17:46] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] apertium-separable: Initial Debian packaging [debs/contenttranslation/apertium-separable] - 10https://gerrit.wikimedia.org/r/421808 (https://phabricator.wikimedia.org/T189075) (owner: 10KartikMistry)
[09:27:04] <wikibugs>	 (03PS1) 10Elukey: role::druid::analytics::worker: enable prometheus monitoring for zk [puppet] - 10https://gerrit.wikimedia.org/r/423140 (https://phabricator.wikimedia.org/T177460)
[09:31:22] <elukey>	 !log restart oozie/hive daemons on an1003 for openjdk-8 upgrades
[09:31:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:37:00] <wikibugs>	 (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10751/" [puppet] - 10https://gerrit.wikimedia.org/r/423140 (https://phabricator.wikimedia.org/T177460) (owner: 10Elukey)
[09:44:39] <wikibugs>	 10Operations, 10OTRS, 10User-notice: Update OTRS to the latest stable version (6.x.x) - https://phabricator.wikimedia.org/T187984#4093821 (10Scoopfinder) p:05Normal>03Triage >> Scoopfinder triaged this task as Normal priority. > Hmm, do you plan to work on this task? (Asking as you prioritized this task....
[09:48:18] <wikibugs>	 (03PS1) 10ArielGlenn: pylint and cleanup of runnerutils [dumps] - 10https://gerrit.wikimedia.org/r/423141
[09:50:09] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure: deployment-etcd-01 puppet errors - https://phabricator.wikimedia.org/T191107#4093831 (10MarcoAurelio) p:05Triage>03Normal
[09:50:28] <wikibugs>	 (03PS4) 10Elukey: profile::restbase: add sysctl settings to improve tcp performance [puppet] - 10https://gerrit.wikimedia.org/r/421901 (https://phabricator.wikimedia.org/T190213)
[09:54:36] <wikibugs>	 (03PS5) 10Elukey: profile::restbase: add sysctl settings to improve tcp performance [puppet] - 10https://gerrit.wikimedia.org/r/421901 (https://phabricator.wikimedia.org/T190213)
[10:03:39] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure: deployment-eventlog05 puppet errors - https://phabricator.wikimedia.org/T191109#4093870 (10MarcoAurelio)
[10:05:29] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure: deployment-eventlog05 puppet errors - https://phabricator.wikimedia.org/T191109#4093881 (10MarcoAurelio) ``` maurelio@deployment-eventlog05:~$ sudo puppet agent -tv Info: Using configured environment 'production' Info: Retrieving pluginfacts Info: Retrieving plugin Info...
[10:08:35] <wikibugs>	 (03PS1) 10ArielGlenn: split out prefetch code to its own module [dumps] - 10https://gerrit.wikimedia.org/r/423143
[10:13:40] <wikibugs>	 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-mira - https://phabricator.wikimedia.org/T191110#4093928 (10MarcoAurelio)
[10:17:36] <elukey>	 !log roll restart of zookeeper daemons on druid100[123] (Druid analytics cluster) to pick up the new prometheus jmx agent
[10:17:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:55:26] <dcausse>	 !log resuming elastic@codfw cluster restarts
[10:55:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:00:57] <volans>	 dcausse: the downtime set yesterday will expire in ~2h, would that be enough to complete the restarts?
[11:03:31] <dcausse>	 volans: should be enough I have only 6 nodes to do and I do them 3 at a time
[11:03:50] <volans>	 ack, in case you're running late let me know and I'll extend it ;)
[11:03:57] <dcausse>	 sure thanks! :)
[11:40:09] <dcausse>	 !log elastic@codfw cluster restarts complete (T189239)
[11:40:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:40:15] <stashbot>	 T189239: Deploy initial version of the extra-analysis plugin - https://phabricator.wikimedia.org/T189239
[11:44:17] <dcausse>	 !log running forceSearchIndex from terbium to cleanup elastic indices for (testwiki, mediawikiwiki, labswiki, labtestwiki, svwiki) (T189694) 
[11:44:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:44:23] <stashbot>	 T189694: forceSearchIndex on testwiki, mediawikiwiki, labswiki, labtestwiki, and svwiki. And everything on Beta Cluster - https://phabricator.wikimedia.org/T189694
[12:22:13] <icinga-wm>	 PROBLEM - Incoming network saturation on labstore1007 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [106250000.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring
[12:28:27] <wikibugs>	 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-mira - https://phabricator.wikimedia.org/T191110#4094366 (10MarcoAurelio) ``` maurelio@deployment-mira:/etc/puppet$ cd modules -bash: cd: modules: No such file or directory ``` It makes sense therefore that puppet can't find t...
[12:30:13] <icinga-wm>	 RECOVERY - Incoming network saturation on labstore1007 is OK: OK: Less than 10.00% above the threshold [93750000.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring
[12:47:11] <akosiaris>	 !log T189076 upload apertium-fra to apt.wikimedia.org/jessie-wikimedia/main
[12:47:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:18] <stashbot>	 T189076: Update apertium-fra-cat MT pair - https://phabricator.wikimedia.org/T189076
[12:47:28] <akosiaris>	 !log T189075 upload apertium-separable to apt.wikimedia.org/jessie-wikimedia/main
[12:47:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:34] <stashbot>	 T189075: Package apertium-separable and dependencies - https://phabricator.wikimedia.org/T189075
[12:47:38] <akosiaris>	 !log T189075 upload apertium-lex-tools to apt.wikimedia.org/jessie-wikimedia/main
[12:47:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:51] <akosiaris>	 !log T189076 upload apertium-cat to apt.wikimedia.org/jessie-wikimedia/main
[12:47:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:48:09] <wikibugs>	 (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-fra-cat] - 10https://gerrit.wikimedia.org/r/421859 (https://phabricator.wikimedia.org/T189076) (owner: 10KartikMistry)
[12:49:24] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] "With the added note that connection reuse is a better approach to solving this +1" [puppet] - 10https://gerrit.wikimedia.org/r/421901 (https://phabricator.wikimedia.org/T190213) (owner: 10Elukey)
[12:49:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] apertium-fra-cat: New upstream release [debs/contenttranslation/apertium-fra-cat] - 10https://gerrit.wikimedia.org/r/421859 (https://phabricator.wikimedia.org/T189076) (owner: 10KartikMistry)
[12:53:43] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 4 others: Thumbor incorrectly normalizes .jpe and .jpeg into .jpg for Swift thumbnail storage - https://phabricator.wikimedia.org/T191028#4094409 (10Gilles)
[12:55:49] <kart_>	 akosiaris: looking at apertium-fra-cat build failure..
[12:57:16] <akosiaris>	 kart_: didn't you have a day off ?
[12:57:41] <akosiaris>	 I thought about pinging you and avoided it exactly because of that
[12:57:49] <kart_>	 :)
[12:58:02] <kart_>	 just saw an email, so thought - ok, lets debug this.
[13:02:40] <wikibugs>	 (03PS1) 10Elukey: cdh::hadoop: improve class documentation [puppet/cdh] - 10https://gerrit.wikimedia.org/r/423154
[13:05:44] <akosiaris>	 kart_: go enjoy your day off :)
[13:05:57] <akosiaris>	 unless you are enjoying debugging 
[13:07:14] <wikibugs>	 (03PS2) 10KartikMistry: apertium-fra-cat: New upstream release [debs/contenttranslation/apertium-fra-cat] - 10https://gerrit.wikimedia.org/r/421859 (https://phabricator.wikimedia.org/T189076)
[13:07:38] <kart_>	 akosiaris: boring evening, but celebrating 3,00,000 articles published using ContentTranslation :)
[13:07:40] <wikibugs>	 (03CR) 10Elukey: [C: 032] cdh::hadoop: improve class documentation [puppet/cdh] - 10https://gerrit.wikimedia.org/r/423154 (owner: 10Elukey)
[13:08:54] <kart_>	 akosiaris: fixed :)
[13:14:31] <wikibugs>	 (03PS1) 10ArielGlenn: break class for handling the list of dump jobs out into its own module [dumps] - 10https://gerrit.wikimedia.org/r/423155
[13:14:52] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] break class for handling the list of dump jobs out into its own module [dumps] - 10https://gerrit.wikimedia.org/r/423155 (owner: 10ArielGlenn)
[13:16:18] <wikibugs>	 (03PS1) 10Elukey: cdh::hadoop: add the config support for HDFS Trash [puppet/cdh] - 10https://gerrit.wikimedia.org/r/423156 (https://phabricator.wikimedia.org/T189051)
[13:16:35] <wikibugs>	 (03PS2) 10ArielGlenn: break class for handling the list of dump jobs out into its own module [dumps] - 10https://gerrit.wikimedia.org/r/423155
[13:30:54] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] apertium-fra-cat: New upstream release [debs/contenttranslation/apertium-fra-cat] - 10https://gerrit.wikimedia.org/r/421859 (https://phabricator.wikimedia.org/T189076) (owner: 10KartikMistry)
[13:51:08] <wikibugs>	 (03PS1) 10Elukey: Update the cdh module to its latest sha [puppet] - 10https://gerrit.wikimedia.org/r/423158
[13:52:56] <wikibugs>	 (03CR) 10Elukey: [C: 032] Update the cdh module to its latest sha [puppet] - 10https://gerrit.wikimedia.org/r/423158 (owner: 10Elukey)
[14:08:18] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4094465 (10BBlack)
[14:16:29] <akosiaris>	 !log T189076 upload apertium-fra-cat to apt.wikimedia.org/jessie-wikimedia/main
[14:16:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:16:37] <stashbot>	 T189076: Update apertium-fra-cat MT pair - https://phabricator.wikimedia.org/T189076
[14:32:29] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4094488 (10BBlack) `CC` and `CX` are the last two in the Asia list that are truly-unknown cases (where we're not real...
[14:32:44] <wikibugs>	 (03PS1) 10BBlack: eqsin: BN, BT, KH, KR, LA, MN, MO, MV, TW [dns] - 10https://gerrit.wikimedia.org/r/423159 (https://phabricator.wikimedia.org/T189252)
[14:32:46] <wikibugs>	 (03PS1) 10BBlack: eqsin: default for AS continent + AP fake-country [dns] - 10https://gerrit.wikimedia.org/r/423160 (https://phabricator.wikimedia.org/T189252)
[14:39:38] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4094494 (10BBlack) Note as a hint (may be irrelevant in practice!) that the cablemap shows a future cable-landing for...
[15:02:37] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4094520 (10BBlack) Ok I dug a bit this morning on CX.  They basically have one non-satellite broadband provider, and...
[15:04:54] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4094523 (10BBlack)
[15:06:08] <wikibugs>	 (03PS2) 10BBlack: eqsin: BN, BT, CC, CX, KH, KR, LA, MN, MO, MV, TW [dns] - 10https://gerrit.wikimedia.org/r/423159 (https://phabricator.wikimedia.org/T189252)
[15:06:10] <wikibugs>	 (03PS2) 10BBlack: eqsin: default for AS continent + AP fake-country [dns] - 10https://gerrit.wikimedia.org/r/423160 (https://phabricator.wikimedia.org/T189252)
[15:12:21] <wikibugs>	 (03PS3) 10BBlack: eqsin: BN, BT, CC, CX, KH, KR, LA, MN, MO, MV, TW [dns] - 10https://gerrit.wikimedia.org/r/423159 (https://phabricator.wikimedia.org/T189252)
[15:12:23] <wikibugs>	 (03PS3) 10BBlack: eqsin: default for AS continent + AP fake-country [dns] - 10https://gerrit.wikimedia.org/r/423160 (https://phabricator.wikimedia.org/T189252)
[15:35:29] <wikibugs>	 10Operations, 10User-Elukey, 10User-Joe: rack/setup/install conf1004-conf1006 - https://phabricator.wikimedia.org/T166081#4094554 (10elukey)
[15:36:02] <wikibugs>	 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-mira - https://phabricator.wikimedia.org/T191110#4094555 (10Dzahn) @MarcoAurelio This looks like it's about data missing in Hiera.  In production we have:  hieradata/role/common/deployment_server.yaml:profile::kubernetes::depl...
[15:54:11] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received
[15:54:11] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) timed out before a response was received
[15:54:52] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:54:52] <icinga-wm>	 PROBLEM - Disk space on scb1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:54:52] <icinga-wm>	 PROBLEM - nutcracker process on scb1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:55:01] <icinga-wm>	 PROBLEM - mathoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:55:11] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:55:12] <icinga-wm>	 PROBLEM - apertium apy on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:55:12] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/references/{title}{/revision} (Get references of a test page) timed out before a response was received
[15:55:12] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[15:55:22] <icinga-wm>	 PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:55:22] <icinga-wm>	 PROBLEM - Check systemd state on scb1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:55:22] <icinga-wm>	 PROBLEM - graphoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:55:31] <icinga-wm>	 PROBLEM - DPKG on scb1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:55:41] <icinga-wm>	 PROBLEM - SSH on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:55:42] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/css/mobile/app/site (Untitled test) timed out before a response was received: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve the selected anniversaries for January 15) timed out befo
[15:55:42] <icinga-wm>	 eceived: /{domain}/v1/page/random/title (retrieve a random article) timed out before a response was received
[15:55:42] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[15:55:52] <icinga-wm>	 PROBLEM - dhclient process on scb1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:55:52] <icinga-wm>	 PROBLEM - Check the NTP synchronisation status of timesyncd on scb1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:55:57] <elukey>	 scb1001 seems in trouble
[15:56:01] <icinga-wm>	 PROBLEM - eventstreams on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:56:21] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[15:56:21] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[15:56:21] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[15:56:21] <icinga-wm>	 PROBLEM - configured eth on scb1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:56:21] <icinga-wm>	 PROBLEM - cpjobqueue endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:56:21] <icinga-wm>	 PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake.
[15:57:01] <elukey>	 cannot ssh to it, let's try mgmt
[15:58:01] <icinga-wm>	 RECOVERY - Disk space on scb1001 is OK: DISK OK
[15:58:02] <icinga-wm>	 RECOVERY - apertium apy on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 5996 bytes in 0.003 second response time
[15:58:12] <icinga-wm>	 RECOVERY - configured eth on scb1001 is OK: OK - interfaces up
[15:58:21] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) is CRITICAL: Test Get extended metadata of a test page returned the unexpected status 504 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[15:58:21] <icinga-wm>	 RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time
[15:58:31] <icinga-wm>	 RECOVERY - Check systemd state on scb1001 is OK: OK - running: The system is fully operational
[15:58:31] <icinga-wm>	 RECOVERY - DPKG on scb1001 is OK: All packages OK
[15:58:32] <icinga-wm>	 RECOVERY - SSH on scb1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0)
[15:58:41] <elukey>	 didn't do anything
[15:58:42] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 responds with malformed body (AttributeError: NoneType object has no attribute get)
[15:59:01] <icinga-wm>	 PROBLEM - eventstreams on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:59:02] <icinga-wm>	 RECOVERY - mathoid endpoints health on scb1001 is OK: All endpoints are healthy
[15:59:12] <icinga-wm>	 PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:59:21] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received
[15:59:21] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303): /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received
[15:59:31] <elukey>	 load average: 2627.89, 4057.61, 2058.28
[15:59:37] <elukey>	 mobrovac: ---^
[15:59:41] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy
[15:59:42] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy
[15:59:51] <icinga-wm>	 RECOVERY - eventstreams on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 1066 bytes in 0.007 second response time
[16:00:01] <icinga-wm>	 RECOVERY - dhclient process on scb1001 is OK: PROCS OK: 0 processes with command name dhclient
[16:00:01] <icinga-wm>	 RECOVERY - nutcracker process on scb1001 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker
[16:00:02] <icinga-wm>	 RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time
[16:00:21] <icinga-wm>	 RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy
[16:00:53] <volans>	 elukey: need help?
[16:01:11] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[16:01:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy
[16:01:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy
[16:01:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy
[16:01:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy
[16:01:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy
[16:01:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy
[16:01:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy
[16:01:22] <icinga-wm>	 RECOVERY - cpjobqueue endpoints health on scb1001 is OK: All endpoints are healthy
[16:01:22] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy
[16:01:41] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy
[16:01:51] <icinga-wm>	 PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed out before a response was received
[16:01:51] <icinga-wm>	 RECOVERY - eventstreams on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1066 bytes in 0.027 second response time
[16:02:18] <elukey>	 volans: currently in a meeting, not sure what happened to scb1001, if you could check it would be great
[16:02:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy
[16:02:21] <icinga-wm>	 RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy
[16:02:23] <elukey>	 but so far it seems fine ?
[16:02:40] <volans>	 elukey: ok, let me have a look
[16:02:51] <icinga-wm>	 RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy
[16:03:31] <icinga-wm>	 RECOVERY - graphoid endpoints health on scb1001 is OK: All endpoints are healthy
[16:04:01] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy
[16:04:33] <volans>	 elukey: dmesg is full of OOMs
[16:04:52] <elukey>	 ahhh oom party
[16:05:11] <elukey>	 yeah I suspected that from the load :D
[16:05:36] <volans>	 seems recovering though
[16:05:51] <icinga-wm>	 PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received
[16:05:53] <volans>	 the OOMs cause are known or should be investigated?
[16:06:10] <elukey>	 not aware of anything ongoing
[16:06:42] <icinga-wm>	 RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy
[16:07:10] <volans>	 ack
[16:11:53] <volans>	 elukey: nothing obvious on system logs right before it started, but I'm not very familiat with those hosts so not sure what to look for in the many /srv/log/* dirs ;)
[16:25:52] <icinga-wm>	 RECOVERY - Check the NTP synchronisation status of timesyncd on scb1001 is OK: OK: synced at Fri 2018-03-30 16:25:50 UTC.
[16:53:18] <wikibugs>	 (03CR) 10Nuria: [C: 031] eventlogging: move alarms from graphite to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/422135 (https://phabricator.wikimedia.org/T114199) (owner: 10Elukey)
[16:53:51] <wikibugs>	 (03PS7) 10Elukey: eventlogging: move alarms from graphite to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/422135 (https://phabricator.wikimedia.org/T114199)
[16:56:28] <wikibugs>	 (03CR) 10Elukey: [C: 032] eventlogging: move alarms from graphite to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/422135 (https://phabricator.wikimedia.org/T114199) (owner: 10Elukey)
[20:01:10] <icinga-wm>	 PROBLEM - Incoming network saturation on labstore1006 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [106250000.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring
[20:09:10] <icinga-wm>	 RECOVERY - Incoming network saturation on labstore1006 is OK: OK: Less than 10.00% above the threshold [93750000.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring
[20:20:40] <Hauskatze>	 mutante: re https://phabricator.wikimedia.org/T191110#4094555 -- do we have to add or remove data?
[20:22:01] <mutante>	 Hauskatze: add
[20:22:29] <mutante>	 "Could not find data item ... in any Hiera data file and no default supplied"
[20:22:31] <Hauskatze>	 mutante: okay, however I'm a bit lost here... How am I suposed to do that? Via gerrit or directly on mira?
[20:23:01] <Hauskatze>	 I've been trying to look for docs but no luck
[20:23:21] <mutante>	 Hauskatze: there are multiple places to add it.. one is the repo, via gerrit, and the other is special Hiera: pages on wikitech/horizon
[20:23:54] <mutante>	 they would all work but it should be in the same place where other related ones are
[20:24:08] <mutante>	 not sure yet myself either which one is used here, but we can search
[20:24:24] <Hauskatze>	 I don't have privs to edit Hiera: pages on wikitech
[20:24:30] <mutante>	 search for "profile::kubernetes::deployment_server::"
[20:24:39] <mutante>	 in both repo and the wiki
[20:24:39] <Hauskatze>	 I think I can do on Horizon, but not sure either
[20:24:56] <mutante>	 or just "deployment_server"
[20:25:28] <wikibugs>	 (03Abandoned) 10Hashar: interface: IPAddr.new() requires an address family [puppet] - 10https://gerrit.wikimedia.org/r/336840 (owner: 10Hashar)
[20:25:51] <Hauskatze>	 mutante: I've found https://github.com/search?q=org%3Awikimedia+profile%3A%3Akubernetes%3A%3Adeployment_server%3A%3A&type=Commits
[20:26:28] <mutante>	 Hauskatze: ah, good! so it's using the labs-private repo
[20:26:32] <wikibugs>	 (03CR) 10Hashar: "I needed that on ruby2.4 / Mac.  I am no more working on puppet-rspec nowadays and work on a machine that has ruby2.3 anyway." [puppet] - 10https://gerrit.wikimedia.org/r/336840 (owner: 10Hashar)
[20:26:56] <Hauskatze>	 https://github.com/wikimedia/puppet/commit/3a2551ce098b17259b12baea6e683486df8dd28c is operations/puppet
[20:28:09] <mutante>	 Hauskatze: soo.. on Gerrit, if you search for the repo called "labs/private" do you see that
[20:28:16] <wikibugs>	 10Operations, 10Beta-Cluster-Infrastructure, 10media-storage, 10Patch-For-Review: nscd does not cache localhost causing high CPU usage when localhost is often resolved - https://phabricator.wikimedia.org/T171745#4094801 (10hashar) 05Open>03declined No time to look into it, so lets archive this task.
[20:28:38] <mutante>	 Hauskatze: this one https://gerrit.wikimedia.org/r/#/q/project:labs/private
[20:28:46] <mutante>	 you should clone that
[20:29:29] <wikibugs>	 (03Abandoned) 10Hashar: swift: save nscd CPU by using IP address [puppet] - 10https://gerrit.wikimedia.org/r/358799 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar)
[20:30:21] <mutante>	 Hauskatze: then, in that repo go to hieradata/role/common/deployment_server.yaml
[20:31:00] <mutante>	 and just like there is ::tokens:  you can add "::git_owner" which is missing
[20:31:07] <mutante>	 if we know what the right value is ... 
[20:31:33] <Hauskatze>	 warning: Clone succeeded, but checkout failed.
[20:31:33] <Hauskatze>	 You can inspect what was checked out with 'git status'
[20:31:33] <Hauskatze>	 and retry the checkout with 'git checkout -f HEAD'
[20:31:44] <Hauskatze>	 some files are erroring
[20:31:56] <Hauskatze>	 error: unable to create file modules/secret/secrets/ssl/*.wikimedia.org.pem: Invalid argument
[20:32:03] <mutante>	 ? hmm
[20:32:04] <Hauskatze>	 but we can commit on gerrit too
[20:32:15] <mutante>	 did you do this in a new and empty directory?
[20:32:35] <mutante>	 you mean gerrit patch uploader?
[20:32:53] <Hauskatze>	 mutante: you can create changes directly on gerrit
[20:32:59] <Hauskatze>	 since a year ago or so
[20:33:03] <mutante>	 oh, the new features :p
[20:33:16] <Hauskatze>	 https://gerrit.wikimedia.org/r/#/admin/projects/labs/private <-- "create change"
[20:33:18] <mutante>	 yea, paladox told me, hehe
[20:33:22] <mutante>	 ok!
[20:33:22] <paladox>	 2 years ago when we upgraded to 2.12
[20:33:27] <mutante>	 ;)
[20:33:29] * paladox uses it all the time :)
[20:33:52] <Hauskatze>	 so, let's re-focus :) -- /me looks for hieradata
[20:34:08] <mutante>	 yes, so in hieradata you have ./role/
[20:34:17] <mutante>	 and these roles match puppet class names
[20:34:23] <mutante>	 that are applied on the instances
[20:35:06] <mutante>	 the error says  it wants profile::kubernetes::deployment_server::git_owner 
[20:35:47] <mutante>	 hieradata/role/common/deployment_server.yaml  has    :profile::kubernetes::deployment_server::tokens  already
[20:35:54] <Hauskatze>	 https://gerrit.wikimedia.org/r/plugins/gitiles/labs/private/+/refs/heads/master/hieradata/role/common/deployment_server.yaml
[20:36:16] <Hauskatze>	 so just add ::git_owner?
[20:36:19] <mutante>	 yea, so if that deployment-prep instance uses role::deployment_server then this gets applied on it
[20:36:27] <mutante>	 yea
[20:36:39] <mutante>	 you can verify what role the instance uses
[20:37:26] <Hauskatze>	 normally when I log-in there the server says the roles, but deployment-mira doesn't seem to use any?
[20:38:06] <Hauskatze>	 it just says "The last Puppet run was at Thu Mar 29 14:10:23 UTC 2018 (1813 minutes ago)."
[20:38:13] <Hauskatze>	 and "do not use this server"
[20:39:34] <mutante>	 i was thinking about the web ui
[20:39:48] <Hauskatze>	 same error on deployment-tin
[20:39:59] <mutante>	 the part that it's called deployment- though and that we just have 2 places using this, deployment and CI...
[20:40:07] <mutante>	 means it should be that
[20:40:21] <mutante>	 ok
[20:40:34] <mutante>	 there is that "prefix" thing in Horizon
[20:40:47] <mutante>	 where it probably says "if host name starts with deploy- then deployment role"
[20:41:08] <mutante>	 on a project-level
[20:42:23] <wikibugs>	 (03Draft1) 10MarcoAurelio: hieradata: fix for deployment-tin/mira lack of ::git_owner [labs/private] - 10https://gerrit.wikimedia.org/r/423178
[20:42:26] <wikibugs>	 (03PS2) 10MarcoAurelio: hieradata: fix for deployment-tin/mira lack of ::git_owner [labs/private] - 10https://gerrit.wikimedia.org/r/423178
[20:42:47] <Hauskatze>	 maybe I coded it wrongly
[20:42:49] <Hauskatze>	 https://gerrit.wikimedia.org/r/#/c/423178/
[20:43:32] <mutante>	 that would add  profile::kubernetes::deployment_server:tokens::git_owner
[20:43:41] <mutante>	 but you want profile::kubernetes::deployment_server::git_owner
[20:44:16] <mutante>	 you need to add it on the same hierarchy level as "tokens"
[20:44:38] <wikibugs>	 (03PS3) 10MarcoAurelio: hieradata: fix for deployment-tin/mira lack of ::git_owner [labs/private] - 10https://gerrit.wikimedia.org/r/423178
[20:44:54] <mutante>	 each indentation in that file means a hierarchy level
[20:44:58] <mutante>	 hence "Hiera"
[20:45:08] <Hauskatze>	 what a pain is puppet, hiera and all that stuff
[20:46:15] <Hauskatze>	 mutante: so I replaced the line with profile::kubernetes::deployment_server::git_owner
[20:46:24] <mutante>	 not quite yet, this way you are removing the existing "tokens" stuff
[20:46:39] <Hauskatze>	 yeah, I though it was quite easy
[20:46:42] <mutante>	 see.. if you look at line 2 in that , that just says "admin" but is indented..
[20:46:42] <Hauskatze>	 :|
[20:47:02] <mutante>	 what this means is that is profile::kubernetes::deployment_server::tokens::admin
[20:47:08] <mutante>	 the server::token for admin
[20:48:19] <mutante>	 leave line 1 untouched and instead add line 6   profile::kubernetes::deployment_server::git_owner: something
[20:49:02] <mutante>	 where something is "trebuchet" or .. i'm not sure
[20:49:47] <mutante>	 it's a  "key: value" 
[20:50:52] <wikibugs>	 (03PS4) 10MarcoAurelio: hieradata: fix for deployment-tin/mira lack of ::git_owner [labs/private] - 10https://gerrit.wikimedia.org/r/423178
[20:51:03] <Hauskatze>	 https://gerrit.wikimedia.org/r/#/c/423178/4/hieradata/role/common/deployment_server.yaml ?
[20:51:08] * Hauskatze headesks
[20:51:35] <Hauskatze>	 shinken says now "PROBLEM - Puppet errors on deployment-mediawiki07 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]"
[20:51:47] <Hauskatze>	 [ another error ]
[20:51:50] <wikibugs>	 (03CR) 10Dzahn: "looks good now, just add the Bug: line to that ticket plz" [labs/private] - 10https://gerrit.wikimedia.org/r/423178 (owner: 10MarcoAurelio)
[20:52:26] <wikibugs>	 (03PS5) 10MarcoAurelio: hieradata: fix for deployment-tin/mira lack of ::git_owner [labs/private] - 10https://gerrit.wikimedia.org/r/423178 (https://phabricator.wikimedia.org/T191110)
[20:53:22] <wikibugs>	 (03CR) 10Dzahn: [V: 032 C: 032] hieradata: fix for deployment-tin/mira lack of ::git_owner [labs/private] - 10https://gerrit.wikimedia.org/r/423178 (https://phabricator.wikimedia.org/T191110) (owner: 10MarcoAurelio)
[20:53:34] <mutante>	 Hauskatze: ok, try puppet again 
[20:53:58] <Hauskatze>	 sudo puppet agent -tv mutante ?
[20:54:04] <mutante>	 (except if the deployment-prep master isn't synced yet or so)
[20:54:08] <mutante>	 Hauskatze: yea
[20:54:51] <Hauskatze>	 Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, Could not find data item profile::kubernetes::deployment_server::git_owner in any Hiera data file and no default supplied at /etc/puppet/modules/profile/manifests/kubernetes/deployment_server.pp:5:16 on node 
[20:54:51] <Hauskatze>	 deployment-tin.deployment-prep.eqiad.wmflabs
[20:54:51] <Hauskatze>	 Warning: Not using cache on failed catalog
[20:54:51] <Hauskatze>	 Error: Could not retrieve catalog; skipping run
[20:55:56] <mutante>	 well, that's just like before
[20:56:03] <mutante>	 seems like that master didnt get the change yet
[20:56:10] <mutante>	 nothing changed, right
[20:57:10] <Hauskatze>	 maybe it needs scap?
[20:57:57] <mutante>	 no, i dont think it's scap related but it does have it's own puppet master 
[20:58:39] <mutante>	 that needs to get the change after we merged in labs/private
[20:58:53] <mutante>	 maybe it's just waiting a few minutes
[21:14:12] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: Puppet broken on deployment-mira - https://phabricator.wikimedia.org/T191110#4094818 (10MarcoAurelio) Puppet still failing:  ``` maurelio@deployment-tin:~$ sudo puppet agent -tv Info: Using configured environment 'future' Info: Retrieving p...
[21:22:38] <icinga-wm>	 PROBLEM - Host db2073.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:23:07] <wikibugs>	 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 4 others: Thumbor incorrectly normalizes .jpe and .jpeg into .jpg for Swift thumbnail storage - https://phabricator.wikimedia.org/T191028#4094821 (10Krinkle) > You could generate just as much cache splitting inserting junk between px-...
[21:23:19] <icinga-wm>	 PROBLEM - Host ps1-c6-codfw is DOWN: PING CRITICAL - Packet loss = 100%
[21:24:48] <icinga-wm>	 PROBLEM - Host ms-be2015.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:24:58] <icinga-wm>	 PROBLEM - Host db2043.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:26:00] <paladox>	 hmm
[21:26:05] <paladox>	 is codfw down?
[21:26:27] <paladox>	 and are  ms-be2015.mgmt  and db2043.mgmt and db2073.mgmt in codfw?
[21:26:39] <icinga-wm>	 PROBLEM - Host db2039.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:27:06] <eddiegp>	 paladox: They're 2xxx, so yes, they're in codfw.
[21:27:12] <paladox>	 ah ok
[21:27:13] <paladox>	 thanks
[21:27:39] <icinga-wm>	 PROBLEM - Host db2033.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:27:39] <icinga-wm>	 PROBLEM - Host db2035.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:27:39] <icinga-wm>	 PROBLEM - Host db2037.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:27:39] <icinga-wm>	 PROBLEM - Host db2036.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:27:39] <icinga-wm>	 PROBLEM - Host db2044.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:27:39] <icinga-wm>	 PROBLEM - Host db2038.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:27:39] <icinga-wm>	 PROBLEM - Host db2041.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:27:40] <icinga-wm>	 PROBLEM - Host db2042.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:27:40] <icinga-wm>	 PROBLEM - Host db2040.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:27:41] <icinga-wm>	 PROBLEM - Host db2047.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:27:41] <icinga-wm>	 PROBLEM - Host db2048.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:27:42] <icinga-wm>	 PROBLEM - Host db2046.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:27:55] <Hauskatze>	 mayday mayday
[21:28:06] <Hauskatze>	 and in a hollyday
[21:28:08] <icinga-wm>	 PROBLEM - Host db2083.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:28:09] <icinga-wm>	 PROBLEM - Host dbstore2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[21:28:13] <Hauskatze>	 well, it's just codfw
[21:28:25] <bd808>	 management interfaces going down in codfw is probably not an emergency
[21:29:14] <Hauskatze>	 equiad would be different I guess bd808 
[21:29:45] <bd808>	 it would be a bit more worrying, but generally the management (mgmt) network is not needed to keep the wikis running
[21:30:02] <paladox>	 ah
[21:30:19] <paladox>	 if it had wmnet would that mean it was down?
[21:30:24] <bd808>	 its the secondary network that is used for "lights out management" of the servers
[21:30:59] <Hauskatze>	 and puppets still broken on beta, but that's another thing
[21:31:07] <bd808>	 paladox: yes, that would be more worrying. The alerts would have just had the hostnames in that case
[21:31:20] <paladox>	 thanks, i should remeber that now :)
[21:31:41] <bd808>	 Hauskatze: so fix it ;) I removed myself from that ticket but it sounded like a hiera value just needed to be updated
[21:31:46] <paladox>	 Hauskatze you may have to run a git fetch origin
[21:31:50] <paladox>	 and a git rebase origin
[21:31:57] <paladox>	 on the puppetmaster
[21:32:03] <paladox>	 as there may be merge conflicts
[21:32:07] <Hauskatze>	 puppetmaster02 ?
[21:32:13] <paladox>	 yep
[21:32:17] <paladox>	 i think so
[21:32:25] <Hauskatze>	 bd808: I associated kubernetes -> bd808; apologies for that
[21:32:55] <Hauskatze>	 paladox: I'll let releng to have a look at that; it's their testing environment after all and I don't want to mess
[21:33:00] <bd808>	 Hauskatze: no worries. I try not to feel responsible for beta cluster things these days. Too many other fires to worry about
[21:33:03] <paladox>	 ok
[21:33:31] * paladox goes back to working on pg :)
[21:34:20] <Hauskatze>	 bd808: indeed, and I may say that if releng isn't caring about them, why should I
[21:34:50] <Hauskatze>	 fact is that I've poked many people and asked them to have a look if they could, with no answer received
[21:35:26] <bd808>	 tending the garden takes a community. it turns out that "owning" the deployment-prep project has never been clearly made the Release Engineering team's "job"
[21:35:51] <bd808>	 it is a shared responsibility project among many Foundation staff and volunteers
[21:36:57] <Hauskatze>	 Wikitech and MediaWiki docs says it's their testing environment. I can help there fixing stuff I know how to, but I can't fight this alone. Too many issues and no docs.
[21:38:14] * Hauskatze prepares an Scotch with ice
[22:15:43] <robh>	 not a mayday, not even worth calling in someone if off hours
[22:15:54] <robh>	 but is work my making a high priority task (which i am) on how to fix ;]
[22:16:25] <mutante>	 robh: oh, so you already making the ticket.. ? i was about to
[22:17:00] <robh>	 mutante: i know exactly what has to happen
[22:17:03] <robh>	 so i can do it =]
[22:17:10] <robh>	 i even know the switch to replace it with =]
[22:17:22] <mutante>	 ok:) cool!
[22:19:46] <robh>	 just checkign his spare sheet to see if we kept any spare netgears
[22:19:47] <robh>	 i doubt it
[22:19:54] <robh>	 but i think we have more than enough spare ex4200s to use instead
[22:25:19] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10media-storage: msw-c6-codfw offline - https://phabricator.wikimedia.org/T191129#4094875 (10RobH) p:05Triage>03High
[22:26:39] <icinga-wm>	 ACKNOWLEDGEMENT - Host db2033.mgmt is DOWN: PING CRITICAL - Packet loss = 100% rhalsell T191129 msw-c6-codfw offline
[22:26:39] <icinga-wm>	 ACKNOWLEDGEMENT - Host db2035.mgmt is DOWN: PING CRITICAL - Packet loss = 100% rhalsell T191129 msw-c6-codfw offline
[22:26:39] <icinga-wm>	 ACKNOWLEDGEMENT - Host db2036.mgmt is DOWN: PING CRITICAL - Packet loss = 100% rhalsell T191129 msw-c6-codfw offline
[22:26:39] <icinga-wm>	 ACKNOWLEDGEMENT - Host db2037.mgmt is DOWN: PING CRITICAL - Packet loss = 100% rhalsell T191129 msw-c6-codfw offline
[22:26:39] <icinga-wm>	 ACKNOWLEDGEMENT - Host db2038.mgmt is DOWN: PING CRITICAL - Packet loss = 100% rhalsell T191129 msw-c6-codfw offline
[22:26:39] <icinga-wm>	 ACKNOWLEDGEMENT - Host db2039.mgmt is DOWN: PING CRITICAL - Packet loss = 100% rhalsell T191129 msw-c6-codfw offline
[22:26:39] <icinga-wm>	 ACKNOWLEDGEMENT - Host db2040.mgmt is DOWN: PING CRITICAL - Packet loss = 100% rhalsell T191129 msw-c6-codfw offline
[22:26:40] <icinga-wm>	 ACKNOWLEDGEMENT - Host db2041.mgmt is DOWN: PING CRITICAL - Packet loss = 100% rhalsell T191129 msw-c6-codfw offline
[22:26:40] <icinga-wm>	 ACKNOWLEDGEMENT - Host db2042.mgmt is DOWN: PING CRITICAL - Packet loss = 100% rhalsell T191129 msw-c6-codfw offline
[22:26:41] <icinga-wm>	 ACKNOWLEDGEMENT - Host db2043.mgmt is DOWN: PING CRITICAL - Packet loss = 100% rhalsell T191129 msw-c6-codfw offline
[22:26:41] <icinga-wm>	 ACKNOWLEDGEMENT - Host db2044.mgmt is DOWN: PING CRITICAL - Packet loss = 100% rhalsell T191129 msw-c6-codfw offline
[22:26:42] <icinga-wm>	 ACKNOWLEDGEMENT - Host db2045.mgmt is DOWN: PING CRITICAL - Packet loss = 100% rhalsell T191129 msw-c6-codfw offline
[22:47:32] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10media-storage: msw-c6-codfw offline - https://phabricator.wikimedia.org/T191129#4094901 (10Volans) I've agreed with @RobH on IRC that this is not UBN for now for the #dba part.  Although assessing the situation I discovered that the rack distribution is far f...
[23:53:24] <wikibugs>	 (03Draft2) 10Zoranzoki21: Enable <mapframe> on ku.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423188
[23:53:45] <wikibugs>	 (03PS3) 10Zoranzoki21: Enable <mapframe> on ku.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423188 (https://phabricator.wikimedia.org/T190944)