[00:03:35] 10Operations, 10cloud-services-team (Kanban): Labstore1006/7 profile for meltdown kernel - https://phabricator.wikimedia.org/T185101#4093533 (10madhuvishy) Reporting back here on what I found So I ran some dd based tests to get baseline numbers 1. dd syncronous test with file larger than client RAM write Ra... [00:09:29] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4093537 (10ayounsi) >>! In T189252#4093427, @BBlack wrote: > [...] but @ayounsi can you confirm this whole list is go... [00:10:23] RECOVERY - Outgoing network saturation on labstore1006 is OK: OK: Less than 10.00% above the threshold [93750000.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [00:11:23] RECOVERY - Incoming network saturation on labstore1007 is OK: OK: Less than 10.00% above the threshold [93750000.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [00:28:23] 10Operations, 10Page-Previews, 10RESTBase, 10Traffic, and 2 others: Cached page previews not shown when refreshed - https://phabricator.wikimedia.org/T184534#4093568 (10Jdlrobson) [00:36:22] 10Operations, 10cloud-services-team (Kanban): Labstore1006/7 profile for meltdown kernel - https://phabricator.wikimedia.org/T185101#4093578 (10madhuvishy) I also ran various tests using fio across the 2 kernels over NFSd - https://tools.wmflabs.org/labstore-profiling/ [00:58:50] ebernhardson: ugh, so sorry - forgot that I had a board meeting for an org I'm part of tonight. [01:08:27] marlier: np, timo covered for you [01:09:14] Oh, great. Krinkle is the man. Thank you! [01:26:30] (03PS1) 10Madhuvishy: dns: Point dumps.wikimedia.org to labstore1007 [dns] - 10https://gerrit.wikimedia.org/r/423078 (https://phabricator.wikimedia.org/T188646) [01:27:08] (03PS2) 10Madhuvishy: dns: Point dumps.wikimedia.org to labstore1007 [dns] - 10https://gerrit.wikimedia.org/r/423078 (https://phabricator.wikimedia.org/T188646) [01:27:32] (03CR) 10Madhuvishy: [C: 04-2] "Putting up in prep of migration, don't merge yet." [dns] - 10https://gerrit.wikimedia.org/r/423078 (https://phabricator.wikimedia.org/T188646) (owner: 10Madhuvishy) [01:27:56] (03CR) 10Madhuvishy: [C: 04-2] dns: Point dumps.wikimedia.org to labstore1007 [dns] - 10https://gerrit.wikimedia.org/r/423078 (https://phabricator.wikimedia.org/T188646) (owner: 10Madhuvishy) [01:37:00] (03PS1) 10Dzahn: add IPv6 records for bromine.eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/423079 (https://phabricator.wikimedia.org/T188163) [01:41:58] (03CR) 10Dzahn: [C: 032] "puppet re-enabled on bromine. IP added on interface by puppet" [dns] - 10https://gerrit.wikimedia.org/r/423079 (https://phabricator.wikimedia.org/T188163) (owner: 10Dzahn) [01:45:47] (03PS1) 10Dzahn: cache::misc: add codfw backend for webserver_misc_static [puppet] - 10https://gerrit.wikimedia.org/r/423080 (https://phabricator.wikimedia.org/T188163) [01:59:23] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 3 others: Image thumbnails of File:Ariano_Irpino_ZI.jpeg do not get purged despite many re-uploads - https://phabricator.wikimedia.org/T191028#4093616 (10Krinkle) [02:03:27] (03PS1) 10Madhuvishy: dumps: Move nfs exports config to template [puppet] - 10https://gerrit.wikimedia.org/r/423081 (https://phabricator.wikimedia.org/T181431) [02:06:01] (03CR) 10Madhuvishy: [C: 032] dumps: Move nfs exports config to template [puppet] - 10https://gerrit.wikimedia.org/r/423081 (https://phabricator.wikimedia.org/T181431) (owner: 10Madhuvishy) [02:08:33] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 11 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#4093624 (10Tgr) Ugh, I am really sorry, I don't know how I could misread that so badly :( I guess my eye is trained to react to the word "Declined" in P... [02:30:31] PROBLEM - Host dumpsdata1001 is DOWN: PING CRITICAL - Packet loss = 100% [02:31:01] RECOVERY - Host dumpsdata1001 is UP: PING OK - Packet loss = 0%, RTA = 0.22 ms [03:03:11] PROBLEM - kubelet operational latencies on kubernetes1004 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))): 36658.67497507478 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:03:21] PROBLEM - kubelet operational latencies on kubernetes1001 is CRITICAL: CRITICAL - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))): 37452.919960474304 = 15000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId= [03:04:11] RECOVERY - kubelet operational latencies on kubernetes1004 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1004.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:04:21] RECOVERY - kubelet operational latencies on kubernetes1001 is OK: OK - scalar( sum(rate(kubelet_runtime_operations_latency_microseconds_sum{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))/ sum(rate(kubelet_runtime_operations_latency_microseconds_count{ job=k8s-node, instance=kubernetes1001.eqiad.wmnet}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:28:31] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 891.02 seconds [03:54:32] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 165.53 seconds [04:03:01] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 3 others: Image thumbnails of File:Ariano_Irpino_ZI.jpeg do not get purged despite many re-uploads - https://phabricator.wikimedia.org/T191028#4091369 (10Krinkle) It looks like there may be two issue going on here. 1. MediaWiki's cano... [05:01:29] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 3 others: Image thumbnails of File:Ariano_Irpino_ZI.jpeg do not get purged despite many re-uploads - https://phabricator.wikimedia.org/T191028#4093691 (10Krinkle) Never mind about the part where I said I can reproduce it locally on pla... [05:17:12] PROBLEM - BGP status on cr1-ulsfo is CRITICAL: BGP CRITICAL - AS1299/IPv4: Connect, AS1299/IPv6: Active [05:20:31] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 34 probes of 324 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [05:25:22] RECOVERY - BGP status on cr1-ulsfo is OK: BGP OK - up: 17, down: 1, shutdown: 0 [05:25:31] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 1 probes of 324 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [05:57:43] (03PS4) 10Elukey: Refactor hadoop/hive monitoring profiles to a simpler structure [puppet] - 10https://gerrit.wikimedia.org/r/423000 (https://phabricator.wikimedia.org/T167790) [06:06:25] (03CR) 10Elukey: "All consumers of profile::prometheus::jmx_agent (except the hadoop ones) are no-op: https://puppet-compiler.wmflabs.org/compiler03/10749/" [puppet] - 10https://gerrit.wikimedia.org/r/423000 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [06:21:02] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 3 others: Image thumbnails of File:Ariano_Irpino_ZI.jpeg do not get purged despite many re-uploads - https://phabricator.wikimedia.org/T191028#4091369 (10Gilles) The first part of the thumb URL is insufficient information, because you'... [07:01:46] (03CR) 10Elukey: [C: 032] Refactor hadoop/hive monitoring profiles to a simpler structure [puppet] - 10https://gerrit.wikimedia.org/r/423000 (https://phabricator.wikimedia.org/T167790) (owner: 10Elukey) [07:10:05] (03CR) 10Alexandros Kosiaris: "It also needs a bump to the Gemfile version for wmf_styleguide to 1.0.1 and most importantly a publishing of it at https://rubygems.org/ge" [puppet] - 10https://gerrit.wikimedia.org/r/420143 (owner: 10Dzahn) [07:32:07] https://phabricator.wikimedia.org/T190988 <- do you need more examples? [07:37:54] yannf: seems fine, but possibly releng is the one that will follow up? Are there actionables for ops? (genuinely asking, don't have a lot of context) [07:39:19] !log rolling restart of yarn-hadoop-nodemanagers on all the hadoop worker nodes after https://gerrit.wikimedia.org/r/423000 [07:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:33] elukey, this didn't happen to my files, I've just spotted them as an admin [07:41:01] yannf: sure sure, I was only trying to figure out if there were actionables for ops (like a possible swift issue etc..) or not. Thanks for the report :) [07:41:44] ok, plz tell me if I can do anything [07:43:13] yannf: I'd say that we could wait for releng's response (or specifically the ones mentioned last in the task) and then see. Keep us in the loop if this doesn't progress [07:43:54] sure [07:50:34] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 3 others: Image thumbnails of File:Ariano_Irpino_ZI.jpeg do not get purged despite many re-uploads - https://phabricator.wikimedia.org/T191028#4093755 (10Gilles) I think I've got it. MediaWiki just goes by what it finds in Swift, as is... [07:51:14] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 3 others: Thumbor incorrectly normalizes .jpe and .jpeg into .jpg for Swift thumbnail storage - https://phabricator.wikimedia.org/T191028#4093756 (10Gilles) p:05Triage>03Normal a:03Gilles [07:51:35] (03PS9) 10Elukey: profile::zookeeper:server: add the support for prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/422920 (https://phabricator.wikimedia.org/T177460) [08:21:27] (03PS10) 10Elukey: profile::zookeeper:server: add the support for prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/422920 (https://phabricator.wikimedia.org/T177460) [08:35:22] (03CR) 10Elukey: "Looks good, plus tested in labs: https://puppet-compiler.wmflabs.org/compiler02/10750/" [puppet] - 10https://gerrit.wikimedia.org/r/422920 (https://phabricator.wikimedia.org/T177460) (owner: 10Elukey) [08:38:19] !log rolling restart of hadoop-hdfs-datanode on all the hadoop worker nodes after https://gerrit.wikimedia.org/r/423000 [08:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:31] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Some comments inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/421901 (https://phabricator.wikimedia.org/T190213) (owner: 10Elukey) [08:59:38] (03CR) 10Elukey: [C: 032] profile::zookeeper:server: add the support for prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/422920 (https://phabricator.wikimedia.org/T177460) (owner: 10Elukey) [09:15:06] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-cat: New upstream release [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/421825 (https://phabricator.wikimedia.org/T189076) (owner: 10KartikMistry) [09:16:18] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-lex-tools: New upstream release [debs/contenttranslation/apertium-lex-tools] - 10https://gerrit.wikimedia.org/r/419356 (https://phabricator.wikimedia.org/T189075) (owner: 10KartikMistry) [09:17:33] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-fra: New upstream release [debs/contenttranslation/apertium-fra] - 10https://gerrit.wikimedia.org/r/421813 (https://phabricator.wikimedia.org/T189076) (owner: 10KartikMistry) [09:17:46] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-separable: Initial Debian packaging [debs/contenttranslation/apertium-separable] - 10https://gerrit.wikimedia.org/r/421808 (https://phabricator.wikimedia.org/T189075) (owner: 10KartikMistry) [09:27:04] (03PS1) 10Elukey: role::druid::analytics::worker: enable prometheus monitoring for zk [puppet] - 10https://gerrit.wikimedia.org/r/423140 (https://phabricator.wikimedia.org/T177460) [09:31:22] !log restart oozie/hive daemons on an1003 for openjdk-8 upgrades [09:31:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:00] (03CR) 10Elukey: [C: 032] "https://puppet-compiler.wmflabs.org/compiler02/10751/" [puppet] - 10https://gerrit.wikimedia.org/r/423140 (https://phabricator.wikimedia.org/T177460) (owner: 10Elukey) [09:44:39] 10Operations, 10OTRS, 10User-notice: Update OTRS to the latest stable version (6.x.x) - https://phabricator.wikimedia.org/T187984#4093821 (10Scoopfinder) p:05Normal>03Triage >> Scoopfinder triaged this task as Normal priority. > Hmm, do you plan to work on this task? (Asking as you prioritized this task.... [09:48:18] (03PS1) 10ArielGlenn: pylint and cleanup of runnerutils [dumps] - 10https://gerrit.wikimedia.org/r/423141 [09:50:09] 10Puppet, 10Beta-Cluster-Infrastructure: deployment-etcd-01 puppet errors - https://phabricator.wikimedia.org/T191107#4093831 (10MarcoAurelio) p:05Triage>03Normal [09:50:28] (03PS4) 10Elukey: profile::restbase: add sysctl settings to improve tcp performance [puppet] - 10https://gerrit.wikimedia.org/r/421901 (https://phabricator.wikimedia.org/T190213) [09:54:36] (03PS5) 10Elukey: profile::restbase: add sysctl settings to improve tcp performance [puppet] - 10https://gerrit.wikimedia.org/r/421901 (https://phabricator.wikimedia.org/T190213) [10:03:39] 10Puppet, 10Beta-Cluster-Infrastructure: deployment-eventlog05 puppet errors - https://phabricator.wikimedia.org/T191109#4093870 (10MarcoAurelio) [10:05:29] 10Puppet, 10Beta-Cluster-Infrastructure: deployment-eventlog05 puppet errors - https://phabricator.wikimedia.org/T191109#4093881 (10MarcoAurelio) ``` maurelio@deployment-eventlog05:~$ sudo puppet agent -tv Info: Using configured environment 'production' Info: Retrieving pluginfacts Info: Retrieving plugin Info... [10:08:35] (03PS1) 10ArielGlenn: split out prefetch code to its own module [dumps] - 10https://gerrit.wikimedia.org/r/423143 [10:13:40] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-mira - https://phabricator.wikimedia.org/T191110#4093928 (10MarcoAurelio) [10:17:36] !log roll restart of zookeeper daemons on druid100[123] (Druid analytics cluster) to pick up the new prometheus jmx agent [10:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:26] !log resuming elastic@codfw cluster restarts [10:55:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:57] dcausse: the downtime set yesterday will expire in ~2h, would that be enough to complete the restarts? [11:03:31] volans: should be enough I have only 6 nodes to do and I do them 3 at a time [11:03:50] ack, in case you're running late let me know and I'll extend it ;) [11:03:57] sure thanks! :) [11:40:09] !log elastic@codfw cluster restarts complete (T189239) [11:40:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:15] T189239: Deploy initial version of the extra-analysis plugin - https://phabricator.wikimedia.org/T189239 [11:44:17] !log running forceSearchIndex from terbium to cleanup elastic indices for (testwiki, mediawikiwiki, labswiki, labtestwiki, svwiki) (T189694) [11:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:23] T189694: forceSearchIndex on testwiki, mediawikiwiki, labswiki, labtestwiki, and svwiki. And everything on Beta Cluster - https://phabricator.wikimedia.org/T189694 [12:22:13] PROBLEM - Incoming network saturation on labstore1007 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [106250000.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [12:28:27] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-mira - https://phabricator.wikimedia.org/T191110#4094366 (10MarcoAurelio) ``` maurelio@deployment-mira:/etc/puppet$ cd modules -bash: cd: modules: No such file or directory ``` It makes sense therefore that puppet can't find t... [12:30:13] RECOVERY - Incoming network saturation on labstore1007 is OK: OK: Less than 10.00% above the threshold [93750000.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [12:47:11] !log T189076 upload apertium-fra to apt.wikimedia.org/jessie-wikimedia/main [12:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:18] T189076: Update apertium-fra-cat MT pair - https://phabricator.wikimedia.org/T189076 [12:47:28] !log T189075 upload apertium-separable to apt.wikimedia.org/jessie-wikimedia/main [12:47:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:34] T189075: Package apertium-separable and dependencies - https://phabricator.wikimedia.org/T189075 [12:47:38] !log T189075 upload apertium-lex-tools to apt.wikimedia.org/jessie-wikimedia/main [12:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:51] !log T189076 upload apertium-cat to apt.wikimedia.org/jessie-wikimedia/main [12:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:09] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-fra-cat] - 10https://gerrit.wikimedia.org/r/421859 (https://phabricator.wikimedia.org/T189076) (owner: 10KartikMistry) [12:49:24] (03CR) 10Alexandros Kosiaris: [C: 031] "With the added note that connection reuse is a better approach to solving this +1" [puppet] - 10https://gerrit.wikimedia.org/r/421901 (https://phabricator.wikimedia.org/T190213) (owner: 10Elukey) [12:49:30] (03CR) 10jerkins-bot: [V: 04-1] apertium-fra-cat: New upstream release [debs/contenttranslation/apertium-fra-cat] - 10https://gerrit.wikimedia.org/r/421859 (https://phabricator.wikimedia.org/T189076) (owner: 10KartikMistry) [12:53:43] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 4 others: Thumbor incorrectly normalizes .jpe and .jpeg into .jpg for Swift thumbnail storage - https://phabricator.wikimedia.org/T191028#4094409 (10Gilles) [12:55:49] akosiaris: looking at apertium-fra-cat build failure.. [12:57:16] kart_: didn't you have a day off ? [12:57:41] I thought about pinging you and avoided it exactly because of that [12:57:49] :) [12:58:02] just saw an email, so thought - ok, lets debug this. [13:02:40] (03PS1) 10Elukey: cdh::hadoop: improve class documentation [puppet/cdh] - 10https://gerrit.wikimedia.org/r/423154 [13:05:44] kart_: go enjoy your day off :) [13:05:57] unless you are enjoying debugging [13:07:14] (03PS2) 10KartikMistry: apertium-fra-cat: New upstream release [debs/contenttranslation/apertium-fra-cat] - 10https://gerrit.wikimedia.org/r/421859 (https://phabricator.wikimedia.org/T189076) [13:07:38] akosiaris: boring evening, but celebrating 3,00,000 articles published using ContentTranslation :) [13:07:40] (03CR) 10Elukey: [C: 032] cdh::hadoop: improve class documentation [puppet/cdh] - 10https://gerrit.wikimedia.org/r/423154 (owner: 10Elukey) [13:08:54] akosiaris: fixed :) [13:14:31] (03PS1) 10ArielGlenn: break class for handling the list of dump jobs out into its own module [dumps] - 10https://gerrit.wikimedia.org/r/423155 [13:14:52] (03CR) 10jerkins-bot: [V: 04-1] break class for handling the list of dump jobs out into its own module [dumps] - 10https://gerrit.wikimedia.org/r/423155 (owner: 10ArielGlenn) [13:16:18] (03PS1) 10Elukey: cdh::hadoop: add the config support for HDFS Trash [puppet/cdh] - 10https://gerrit.wikimedia.org/r/423156 (https://phabricator.wikimedia.org/T189051) [13:16:35] (03PS2) 10ArielGlenn: break class for handling the list of dump jobs out into its own module [dumps] - 10https://gerrit.wikimedia.org/r/423155 [13:30:54] (03CR) 10Alexandros Kosiaris: [C: 032] apertium-fra-cat: New upstream release [debs/contenttranslation/apertium-fra-cat] - 10https://gerrit.wikimedia.org/r/421859 (https://phabricator.wikimedia.org/T189076) (owner: 10KartikMistry) [13:51:08] (03PS1) 10Elukey: Update the cdh module to its latest sha [puppet] - 10https://gerrit.wikimedia.org/r/423158 [13:52:56] (03CR) 10Elukey: [C: 032] Update the cdh module to its latest sha [puppet] - 10https://gerrit.wikimedia.org/r/423158 (owner: 10Elukey) [14:08:18] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4094465 (10BBlack) [14:16:29] !log T189076 upload apertium-fra-cat to apt.wikimedia.org/jessie-wikimedia/main [14:16:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:37] T189076: Update apertium-fra-cat MT pair - https://phabricator.wikimedia.org/T189076 [14:32:29] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4094488 (10BBlack) `CC` and `CX` are the last two in the Asia list that are truly-unknown cases (where we're not real... [14:32:44] (03PS1) 10BBlack: eqsin: BN, BT, KH, KR, LA, MN, MO, MV, TW [dns] - 10https://gerrit.wikimedia.org/r/423159 (https://phabricator.wikimedia.org/T189252) [14:32:46] (03PS1) 10BBlack: eqsin: default for AS continent + AP fake-country [dns] - 10https://gerrit.wikimedia.org/r/423160 (https://phabricator.wikimedia.org/T189252) [14:39:38] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4094494 (10BBlack) Note as a hint (may be irrelevant in practice!) that the cablemap shows a future cable-landing for... [15:02:37] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4094520 (10BBlack) Ok I dug a bit this morning on CX. They basically have one non-satellite broadband provider, and... [15:04:54] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4094523 (10BBlack) [15:06:08] (03PS2) 10BBlack: eqsin: BN, BT, CC, CX, KH, KR, LA, MN, MO, MV, TW [dns] - 10https://gerrit.wikimedia.org/r/423159 (https://phabricator.wikimedia.org/T189252) [15:06:10] (03PS2) 10BBlack: eqsin: default for AS continent + AP fake-country [dns] - 10https://gerrit.wikimedia.org/r/423160 (https://phabricator.wikimedia.org/T189252) [15:12:21] (03PS3) 10BBlack: eqsin: BN, BT, CC, CX, KH, KR, LA, MN, MO, MV, TW [dns] - 10https://gerrit.wikimedia.org/r/423159 (https://phabricator.wikimedia.org/T189252) [15:12:23] (03PS3) 10BBlack: eqsin: default for AS continent + AP fake-country [dns] - 10https://gerrit.wikimedia.org/r/423160 (https://phabricator.wikimedia.org/T189252) [15:35:29] 10Operations, 10User-Elukey, 10User-Joe: rack/setup/install conf1004-conf1006 - https://phabricator.wikimedia.org/T166081#4094554 (10elukey) [15:36:02] 10Operations, 10Puppet, 10Beta-Cluster-Infrastructure: Puppet broken on deployment-mira - https://phabricator.wikimedia.org/T191110#4094555 (10Dzahn) @MarcoAurelio This looks like it's about data missing in Hiera. In production we have: hieradata/role/common/deployment_server.yaml:profile::kubernetes::depl... [15:54:11] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [15:54:11] PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) timed out before a response was received [15:54:52] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:54:52] PROBLEM - Disk space on scb1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:54:52] PROBLEM - nutcracker process on scb1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:55:01] PROBLEM - mathoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:55:11] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:55:12] PROBLEM - apertium apy on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:55:12] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/references/{title}{/revision} (Get references of a test page) timed out before a response was received [15:55:12] PROBLEM - restbase endpoints health on restbase1013 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [15:55:22] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:55:22] PROBLEM - Check systemd state on scb1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:55:22] PROBLEM - graphoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:55:31] PROBLEM - DPKG on scb1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:55:41] PROBLEM - SSH on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:55:42] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for January 1, 2016) timed out before a response was received: /{domain}/v1/page/css/mobile/app/site (Untitled test) timed out before a response was received: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve the selected anniversaries for January 15) timed out befo [15:55:42] eceived: /{domain}/v1/page/random/title (retrieve a random article) timed out before a response was received [15:55:42] PROBLEM - restbase endpoints health on restbase1011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [15:55:52] PROBLEM - dhclient process on scb1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:55:52] PROBLEM - Check the NTP synchronisation status of timesyncd on scb1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:55:57] scb1001 seems in trouble [15:56:01] PROBLEM - eventstreams on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:56:21] PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [15:56:21] PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [15:56:21] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [15:56:21] PROBLEM - configured eth on scb1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:56:21] PROBLEM - cpjobqueue endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:56:21] PROBLEM - changeprop endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [15:57:01] cannot ssh to it, let's try mgmt [15:58:01] RECOVERY - Disk space on scb1001 is OK: DISK OK [15:58:02] RECOVERY - apertium apy on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 5996 bytes in 0.003 second response time [15:58:12] RECOVERY - configured eth on scb1001 is OK: OK - interfaces up [15:58:21] PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /en.wikipedia.org/v1/page/metadata/{title}{/revision} (Get extended metadata of a test page) is CRITICAL: Test Get extended metadata of a test page returned the unexpected status 504 (expecting: 200): /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received [15:58:21] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.003 second response time [15:58:31] RECOVERY - Check systemd state on scb1001 is OK: OK - running: The system is fully operational [15:58:31] RECOVERY - DPKG on scb1001 is OK: All packages OK [15:58:32] RECOVERY - SSH on scb1001 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u4 (protocol 2.0) [15:58:41] didn't do anything [15:58:42] PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) is CRITICAL: Test Retrieve aggregated feed content for April 29, 2016 responds with malformed body (AttributeError: NoneType object has no attribute get) [15:59:01] PROBLEM - eventstreams on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:59:02] RECOVERY - mathoid endpoints health on scb1001 is OK: All endpoints are healthy [15:59:12] PROBLEM - pdfrender on scb1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:59:21] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [15:59:21] PROBLEM - restbase endpoints health on restbase1014 is CRITICAL: /en.wikipedia.org/v1/page/random/{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 504 (expecting: 303): /en.wikipedia.org/v1/feed/onthisday/{type}/{mm}/{dd} (Retrieve selected the events for Jan 01) timed out before a response was received [15:59:31] load average: 2627.89, 4057.61, 2058.28 [15:59:37] mobrovac: ---^ [15:59:41] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [15:59:42] RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy [15:59:51] RECOVERY - eventstreams on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 1066 bytes in 0.007 second response time [16:00:01] RECOVERY - dhclient process on scb1001 is OK: PROCS OK: 0 processes with command name dhclient [16:00:01] RECOVERY - nutcracker process on scb1001 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [16:00:02] RECOVERY - pdfrender on scb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.002 second response time [16:00:21] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy [16:00:53] elukey: need help? [16:01:11] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [16:01:21] RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy [16:01:21] RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy [16:01:21] RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy [16:01:21] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy [16:01:21] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy [16:01:21] RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy [16:01:21] RECOVERY - restbase endpoints health on restbase1014 is OK: All endpoints are healthy [16:01:22] RECOVERY - cpjobqueue endpoints health on scb1001 is OK: All endpoints are healthy [16:01:22] RECOVERY - restbase endpoints health on restbase1013 is OK: All endpoints are healthy [16:01:41] RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy [16:01:51] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v2/translate/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium, adapt the links to target language wiki.) timed out before a response was received: /v1/mt/{from}/{to}{/provider} (Machine translate an HTML fragment using Apertium.) timed out before a response was received [16:01:51] RECOVERY - eventstreams on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1066 bytes in 0.027 second response time [16:02:18] volans: currently in a meeting, not sure what happened to scb1001, if you could check it would be great [16:02:21] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy [16:02:21] RECOVERY - changeprop endpoints health on scb1001 is OK: All endpoints are healthy [16:02:23] but so far it seems fine ? [16:02:40] elukey: ok, let me have a look [16:02:51] RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy [16:03:31] RECOVERY - graphoid endpoints health on scb1001 is OK: All endpoints are healthy [16:04:01] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [16:04:33] elukey: dmesg is full of OOMs [16:04:52] ahhh oom party [16:05:11] yeah I suspected that from the load :D [16:05:36] seems recovering though [16:05:51] PROBLEM - cxserver endpoints health on scb1001 is CRITICAL: /v1/page/{language}/{title}{/revision} (Fetch enwiki Oxygen page) timed out before a response was received [16:05:53] the OOMs cause are known or should be investigated? [16:06:10] not aware of anything ongoing [16:06:42] RECOVERY - cxserver endpoints health on scb1001 is OK: All endpoints are healthy [16:07:10] ack [16:11:53] elukey: nothing obvious on system logs right before it started, but I'm not very familiat with those hosts so not sure what to look for in the many /srv/log/* dirs ;) [16:25:52] RECOVERY - Check the NTP synchronisation status of timesyncd on scb1001 is OK: OK: synced at Fri 2018-03-30 16:25:50 UTC. [16:53:18] (03CR) 10Nuria: [C: 031] eventlogging: move alarms from graphite to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/422135 (https://phabricator.wikimedia.org/T114199) (owner: 10Elukey) [16:53:51] (03PS7) 10Elukey: eventlogging: move alarms from graphite to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/422135 (https://phabricator.wikimedia.org/T114199) [16:56:28] (03CR) 10Elukey: [C: 032] eventlogging: move alarms from graphite to prometheus [puppet] - 10https://gerrit.wikimedia.org/r/422135 (https://phabricator.wikimedia.org/T114199) (owner: 10Elukey) [20:01:10] PROBLEM - Incoming network saturation on labstore1006 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [106250000.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [20:09:10] RECOVERY - Incoming network saturation on labstore1006 is OK: OK: Less than 10.00% above the threshold [93750000.0] https://grafana.wikimedia.org/dashboard/db/labs-monitoring [20:20:40] mutante: re https://phabricator.wikimedia.org/T191110#4094555 -- do we have to add or remove data? [20:22:01] Hauskatze: add [20:22:29] "Could not find data item ... in any Hiera data file and no default supplied" [20:22:31] mutante: okay, however I'm a bit lost here... How am I suposed to do that? Via gerrit or directly on mira? [20:23:01] I've been trying to look for docs but no luck [20:23:21] Hauskatze: there are multiple places to add it.. one is the repo, via gerrit, and the other is special Hiera: pages on wikitech/horizon [20:23:54] they would all work but it should be in the same place where other related ones are [20:24:08] not sure yet myself either which one is used here, but we can search [20:24:24] I don't have privs to edit Hiera: pages on wikitech [20:24:30] search for "profile::kubernetes::deployment_server::" [20:24:39] in both repo and the wiki [20:24:39] I think I can do on Horizon, but not sure either [20:24:56] or just "deployment_server" [20:25:28] (03Abandoned) 10Hashar: interface: IPAddr.new() requires an address family [puppet] - 10https://gerrit.wikimedia.org/r/336840 (owner: 10Hashar) [20:25:51] mutante: I've found https://github.com/search?q=org%3Awikimedia+profile%3A%3Akubernetes%3A%3Adeployment_server%3A%3A&type=Commits [20:26:28] Hauskatze: ah, good! so it's using the labs-private repo [20:26:32] (03CR) 10Hashar: "I needed that on ruby2.4 / Mac. I am no more working on puppet-rspec nowadays and work on a machine that has ruby2.3 anyway." [puppet] - 10https://gerrit.wikimedia.org/r/336840 (owner: 10Hashar) [20:26:56] https://github.com/wikimedia/puppet/commit/3a2551ce098b17259b12baea6e683486df8dd28c is operations/puppet [20:28:09] Hauskatze: soo.. on Gerrit, if you search for the repo called "labs/private" do you see that [20:28:16] 10Operations, 10Beta-Cluster-Infrastructure, 10media-storage, 10Patch-For-Review: nscd does not cache localhost causing high CPU usage when localhost is often resolved - https://phabricator.wikimedia.org/T171745#4094801 (10hashar) 05Open>03declined No time to look into it, so lets archive this task. [20:28:38] Hauskatze: this one https://gerrit.wikimedia.org/r/#/q/project:labs/private [20:28:46] you should clone that [20:29:29] (03Abandoned) 10Hashar: swift: save nscd CPU by using IP address [puppet] - 10https://gerrit.wikimedia.org/r/358799 (https://phabricator.wikimedia.org/T160990) (owner: 10Hashar) [20:30:21] Hauskatze: then, in that repo go to hieradata/role/common/deployment_server.yaml [20:31:00] and just like there is ::tokens: you can add "::git_owner" which is missing [20:31:07] if we know what the right value is ... [20:31:33] warning: Clone succeeded, but checkout failed. [20:31:33] You can inspect what was checked out with 'git status' [20:31:33] and retry the checkout with 'git checkout -f HEAD' [20:31:44] some files are erroring [20:31:56] error: unable to create file modules/secret/secrets/ssl/*.wikimedia.org.pem: Invalid argument [20:32:03] ? hmm [20:32:04] but we can commit on gerrit too [20:32:15] did you do this in a new and empty directory? [20:32:35] you mean gerrit patch uploader? [20:32:53] mutante: you can create changes directly on gerrit [20:32:59] since a year ago or so [20:33:03] oh, the new features :p [20:33:16] https://gerrit.wikimedia.org/r/#/admin/projects/labs/private <-- "create change" [20:33:18] yea, paladox told me, hehe [20:33:22] ok! [20:33:22] 2 years ago when we upgraded to 2.12 [20:33:27] ;) [20:33:29] * paladox uses it all the time :) [20:33:52] so, let's re-focus :) -- /me looks for hieradata [20:34:08] yes, so in hieradata you have ./role/ [20:34:17] and these roles match puppet class names [20:34:23] that are applied on the instances [20:35:06] the error says it wants profile::kubernetes::deployment_server::git_owner [20:35:47] hieradata/role/common/deployment_server.yaml has :profile::kubernetes::deployment_server::tokens already [20:35:54] https://gerrit.wikimedia.org/r/plugins/gitiles/labs/private/+/refs/heads/master/hieradata/role/common/deployment_server.yaml [20:36:16] so just add ::git_owner? [20:36:19] yea, so if that deployment-prep instance uses role::deployment_server then this gets applied on it [20:36:27] yea [20:36:39] you can verify what role the instance uses [20:37:26] normally when I log-in there the server says the roles, but deployment-mira doesn't seem to use any? [20:38:06] it just says "The last Puppet run was at Thu Mar 29 14:10:23 UTC 2018 (1813 minutes ago)." [20:38:13] and "do not use this server" [20:39:34] i was thinking about the web ui [20:39:48] same error on deployment-tin [20:39:59] the part that it's called deployment- though and that we just have 2 places using this, deployment and CI... [20:40:07] means it should be that [20:40:21] ok [20:40:34] there is that "prefix" thing in Horizon [20:40:47] where it probably says "if host name starts with deploy- then deployment role" [20:41:08] on a project-level [20:42:23] (03Draft1) 10MarcoAurelio: hieradata: fix for deployment-tin/mira lack of ::git_owner [labs/private] - 10https://gerrit.wikimedia.org/r/423178 [20:42:26] (03PS2) 10MarcoAurelio: hieradata: fix for deployment-tin/mira lack of ::git_owner [labs/private] - 10https://gerrit.wikimedia.org/r/423178 [20:42:47] maybe I coded it wrongly [20:42:49] https://gerrit.wikimedia.org/r/#/c/423178/ [20:43:32] that would add profile::kubernetes::deployment_server:tokens::git_owner [20:43:41] but you want profile::kubernetes::deployment_server::git_owner [20:44:16] you need to add it on the same hierarchy level as "tokens" [20:44:38] (03PS3) 10MarcoAurelio: hieradata: fix for deployment-tin/mira lack of ::git_owner [labs/private] - 10https://gerrit.wikimedia.org/r/423178 [20:44:54] each indentation in that file means a hierarchy level [20:44:58] hence "Hiera" [20:45:08] what a pain is puppet, hiera and all that stuff [20:46:15] mutante: so I replaced the line with profile::kubernetes::deployment_server::git_owner [20:46:24] not quite yet, this way you are removing the existing "tokens" stuff [20:46:39] yeah, I though it was quite easy [20:46:42] see.. if you look at line 2 in that , that just says "admin" but is indented.. [20:46:42] :| [20:47:02] what this means is that is profile::kubernetes::deployment_server::tokens::admin [20:47:08] the server::token for admin [20:48:19] leave line 1 untouched and instead add line 6 profile::kubernetes::deployment_server::git_owner: something [20:49:02] where something is "trebuchet" or .. i'm not sure [20:49:47] it's a "key: value" [20:50:52] (03PS4) 10MarcoAurelio: hieradata: fix for deployment-tin/mira lack of ::git_owner [labs/private] - 10https://gerrit.wikimedia.org/r/423178 [20:51:03] https://gerrit.wikimedia.org/r/#/c/423178/4/hieradata/role/common/deployment_server.yaml ? [20:51:08] * Hauskatze headesks [20:51:35] shinken says now "PROBLEM - Puppet errors on deployment-mediawiki07 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]" [20:51:47] [ another error ] [20:51:50] (03CR) 10Dzahn: "looks good now, just add the Bug: line to that ticket plz" [labs/private] - 10https://gerrit.wikimedia.org/r/423178 (owner: 10MarcoAurelio) [20:52:26] (03PS5) 10MarcoAurelio: hieradata: fix for deployment-tin/mira lack of ::git_owner [labs/private] - 10https://gerrit.wikimedia.org/r/423178 (https://phabricator.wikimedia.org/T191110) [20:53:22] (03CR) 10Dzahn: [V: 032 C: 032] hieradata: fix for deployment-tin/mira lack of ::git_owner [labs/private] - 10https://gerrit.wikimedia.org/r/423178 (https://phabricator.wikimedia.org/T191110) (owner: 10MarcoAurelio) [20:53:34] Hauskatze: ok, try puppet again [20:53:58] sudo puppet agent -tv mutante ? [20:54:04] (except if the deployment-prep master isn't synced yet or so) [20:54:08] Hauskatze: yea [20:54:51] Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Evaluation Error: Error while evaluating a Function Call, Could not find data item profile::kubernetes::deployment_server::git_owner in any Hiera data file and no default supplied at /etc/puppet/modules/profile/manifests/kubernetes/deployment_server.pp:5:16 on node [20:54:51] deployment-tin.deployment-prep.eqiad.wmflabs [20:54:51] Warning: Not using cache on failed catalog [20:54:51] Error: Could not retrieve catalog; skipping run [20:55:56] well, that's just like before [20:56:03] seems like that master didnt get the change yet [20:56:10] nothing changed, right [20:57:10] maybe it needs scap? [20:57:57] no, i dont think it's scap related but it does have it's own puppet master [20:58:39] that needs to get the change after we merged in labs/private [20:58:53] maybe it's just waiting a few minutes [21:14:12] 10Puppet, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team: Puppet broken on deployment-mira - https://phabricator.wikimedia.org/T191110#4094818 (10MarcoAurelio) Puppet still failing: ``` maurelio@deployment-tin:~$ sudo puppet agent -tv Info: Using configured environment 'future' Info: Retrieving p... [21:22:38] PROBLEM - Host db2073.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:23:07] 10Operations, 10Commons, 10MediaWiki-File-management, 10Multimedia, and 4 others: Thumbor incorrectly normalizes .jpe and .jpeg into .jpg for Swift thumbnail storage - https://phabricator.wikimedia.org/T191028#4094821 (10Krinkle) > You could generate just as much cache splitting inserting junk between px-... [21:23:19] PROBLEM - Host ps1-c6-codfw is DOWN: PING CRITICAL - Packet loss = 100% [21:24:48] PROBLEM - Host ms-be2015.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:24:58] PROBLEM - Host db2043.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:26:00] hmm [21:26:05] is codfw down? [21:26:27] and are ms-be2015.mgmt and db2043.mgmt and db2073.mgmt in codfw? [21:26:39] PROBLEM - Host db2039.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:27:06] paladox: They're 2xxx, so yes, they're in codfw. [21:27:12] ah ok [21:27:13] thanks [21:27:39] PROBLEM - Host db2033.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:27:39] PROBLEM - Host db2035.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:27:39] PROBLEM - Host db2037.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:27:39] PROBLEM - Host db2036.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:27:39] PROBLEM - Host db2044.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:27:39] PROBLEM - Host db2038.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:27:39] PROBLEM - Host db2041.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:27:40] PROBLEM - Host db2042.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:27:40] PROBLEM - Host db2040.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:27:41] PROBLEM - Host db2047.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:27:41] PROBLEM - Host db2048.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:27:42] PROBLEM - Host db2046.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:27:55] mayday mayday [21:28:06] and in a hollyday [21:28:08] PROBLEM - Host db2083.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:28:09] PROBLEM - Host dbstore2002.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [21:28:13] well, it's just codfw [21:28:25] management interfaces going down in codfw is probably not an emergency [21:29:14] equiad would be different I guess bd808 [21:29:45] it would be a bit more worrying, but generally the management (mgmt) network is not needed to keep the wikis running [21:30:02] ah [21:30:19] if it had wmnet would that mean it was down? [21:30:24] its the secondary network that is used for "lights out management" of the servers [21:30:59] and puppets still broken on beta, but that's another thing [21:31:07] paladox: yes, that would be more worrying. The alerts would have just had the hostnames in that case [21:31:20] thanks, i should remeber that now :) [21:31:41] Hauskatze: so fix it ;) I removed myself from that ticket but it sounded like a hiera value just needed to be updated [21:31:46] Hauskatze you may have to run a git fetch origin [21:31:50] and a git rebase origin [21:31:57] on the puppetmaster [21:32:03] as there may be merge conflicts [21:32:07] puppetmaster02 ? [21:32:13] yep [21:32:17] i think so [21:32:25] bd808: I associated kubernetes -> bd808; apologies for that [21:32:55] paladox: I'll let releng to have a look at that; it's their testing environment after all and I don't want to mess [21:33:00] Hauskatze: no worries. I try not to feel responsible for beta cluster things these days. Too many other fires to worry about [21:33:03] ok [21:33:31] * paladox goes back to working on pg :) [21:34:20] bd808: indeed, and I may say that if releng isn't caring about them, why should I [21:34:50] fact is that I've poked many people and asked them to have a look if they could, with no answer received [21:35:26] tending the garden takes a community. it turns out that "owning" the deployment-prep project has never been clearly made the Release Engineering team's "job" [21:35:51] it is a shared responsibility project among many Foundation staff and volunteers [21:36:57] Wikitech and MediaWiki docs says it's their testing environment. I can help there fixing stuff I know how to, but I can't fight this alone. Too many issues and no docs. [21:38:14] * Hauskatze prepares an Scotch with ice [22:15:43] not a mayday, not even worth calling in someone if off hours [22:15:54] but is work my making a high priority task (which i am) on how to fix ;] [22:16:25] robh: oh, so you already making the ticket.. ? i was about to [22:17:00] mutante: i know exactly what has to happen [22:17:03] so i can do it =] [22:17:10] i even know the switch to replace it with =] [22:17:22] ok:) cool! [22:19:46] just checkign his spare sheet to see if we kept any spare netgears [22:19:47] i doubt it [22:19:54] but i think we have more than enough spare ex4200s to use instead [22:25:19] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10media-storage: msw-c6-codfw offline - https://phabricator.wikimedia.org/T191129#4094875 (10RobH) p:05Triage>03High [22:26:39] ACKNOWLEDGEMENT - Host db2033.mgmt is DOWN: PING CRITICAL - Packet loss = 100% rhalsell T191129 msw-c6-codfw offline [22:26:39] ACKNOWLEDGEMENT - Host db2035.mgmt is DOWN: PING CRITICAL - Packet loss = 100% rhalsell T191129 msw-c6-codfw offline [22:26:39] ACKNOWLEDGEMENT - Host db2036.mgmt is DOWN: PING CRITICAL - Packet loss = 100% rhalsell T191129 msw-c6-codfw offline [22:26:39] ACKNOWLEDGEMENT - Host db2037.mgmt is DOWN: PING CRITICAL - Packet loss = 100% rhalsell T191129 msw-c6-codfw offline [22:26:39] ACKNOWLEDGEMENT - Host db2038.mgmt is DOWN: PING CRITICAL - Packet loss = 100% rhalsell T191129 msw-c6-codfw offline [22:26:39] ACKNOWLEDGEMENT - Host db2039.mgmt is DOWN: PING CRITICAL - Packet loss = 100% rhalsell T191129 msw-c6-codfw offline [22:26:39] ACKNOWLEDGEMENT - Host db2040.mgmt is DOWN: PING CRITICAL - Packet loss = 100% rhalsell T191129 msw-c6-codfw offline [22:26:40] ACKNOWLEDGEMENT - Host db2041.mgmt is DOWN: PING CRITICAL - Packet loss = 100% rhalsell T191129 msw-c6-codfw offline [22:26:40] ACKNOWLEDGEMENT - Host db2042.mgmt is DOWN: PING CRITICAL - Packet loss = 100% rhalsell T191129 msw-c6-codfw offline [22:26:41] ACKNOWLEDGEMENT - Host db2043.mgmt is DOWN: PING CRITICAL - Packet loss = 100% rhalsell T191129 msw-c6-codfw offline [22:26:41] ACKNOWLEDGEMENT - Host db2044.mgmt is DOWN: PING CRITICAL - Packet loss = 100% rhalsell T191129 msw-c6-codfw offline [22:26:42] ACKNOWLEDGEMENT - Host db2045.mgmt is DOWN: PING CRITICAL - Packet loss = 100% rhalsell T191129 msw-c6-codfw offline [22:47:32] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10media-storage: msw-c6-codfw offline - https://phabricator.wikimedia.org/T191129#4094901 (10Volans) I've agreed with @RobH on IRC that this is not UBN for now for the #dba part. Although assessing the situation I discovered that the rack distribution is far f... [23:53:24] (03Draft2) 10Zoranzoki21: Enable on ku.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423188 [23:53:45] (03PS3) 10Zoranzoki21: Enable on ku.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/423188 (https://phabricator.wikimedia.org/T190944)