[00:00:50] 06Operations, 10IDS-extension, 10Wikimedia-Extension-setup, 07I18n: Deploy IDS rendering engine to production - https://phabricator.wikimedia.org/T148693#2897602 (10kaldari) [00:02:30] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [00:14:30] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [00:15:30] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3878708 keys, up 52 days 15 hours - replication_delay is 0 [00:21:30] PROBLEM - puppet last run on kafka1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:40:41] rather impressed, running LambdaMART against our small set with minimal hyperopt tuning, in the first tree it comes up with with the same ndcg@10 as our current prod weights (0.77), by 75 tree's it has pushed up to 0.80. Not sure how much further to let it run, but by the default cutoff of 1k trees it's able to push the score to 0.835 [00:40:55] (03CR) 10Krinkle: labsdb: cleanup maintain-meta_p enough to make it viable (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/325949 (owner: 10Rush) [00:41:30] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [00:41:47] wrong room [00:42:30] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3876330 keys, up 52 days 16 hours - replication_delay is 48 [00:42:50] PROBLEM - puppet last run on cp3035 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:49:30] RECOVERY - puppet last run on kafka1014 is OK: OK: Puppet is currently enabled, last run 26 seconds ago with 0 failures [01:09:14] 06Operations: Lost session data on every save attempt - https://phabricator.wikimedia.org/T153984#2897779 (10Tacsipacsi) 05Open>03Invalid Oops, nope. It works now, thanks! (N.B. Have I added the right tag? I was not sure how to classify this "mysterious error". What should I do if something like this happens... [01:10:50] RECOVERY - puppet last run on cp3035 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [01:23:50] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 641 600 - REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3877846 keys, up 52 days 17 hours - replication_delay is 641 [01:26:50] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3873363 keys, up 52 days 17 hours - replication_delay is 0 [01:34:20] PROBLEM - puppet last run on ganeti1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:01:45] (03PS6) 10Kaldari: Add cron job for PageAssessments maintenance script to puppet [puppet] - 10https://gerrit.wikimedia.org/r/326856 (https://phabricator.wikimedia.org/T153026) [02:02:20] RECOVERY - puppet last run on ganeti1001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [02:22:01] Hey guys [02:22:10] are you having issues with text-lb.esams.wikimedia.org? [02:23:27] We're seeing a small degree of packetloss for certain caches [02:23:31] well - 10% [02:23:40] on an mtr -rw [02:24:03] t0mb0: ping from Illiad network (France) and Voo (Begium) [02:24:17] could you publish your stacktrace? [02:24:27] yeah hold on [02:24:49] we are monitoring a lot of regional caches [02:24:52] and it's not isolated to just one [02:24:58] hold on just dumping it into pastebin [02:26:59] If you've an account on Phabricator you can use https://phabricator.wikimedia.org/paste/. If you've `arc` installed, you can even do a `command | arc paste` (well not from mtr) [02:27:44] unfortunately our sec policies won't permit me to show you the entire trace [02:27:54] but I can show you the trace from our providers core [02:28:02] FWIW there is no loss before that point [02:28:22] That's indeed the part with loss the helpful one. [02:28:24] http://pastebin.ubuntu.com/23671267/ [02:29:31] Strange. [02:29:42] not all our peers are down [02:29:54] for example mr.wikipedia.org is OK [02:30:11] we are caching with squid [02:30:22] so packetloss and latency are sensitive [02:32:18] do you also have packet loss when you try to reach text-lb.eqiad.wikimedia.org? [02:32:48] are you listening on port 80? [02:32:51] this is tcp trace [02:33:15] giving it a whirl [02:33:15] yep [02:33:18] we have loss [02:33:21] I'll try just icmp [02:33:56] ah I can't we are blocking icmp out [02:35:44] And would you have an idea when that started? [02:36:35] 23:01:31 UTC [02:37:15] Thanks. [02:40:08] Dereckson: ok so our caches have all come back except one [02:40:37] 06Operations, 10netops: Packet loss from Voxel to text load balancers - https://phabricator.wikimedia.org/T153998#2897912 (10Dereckson) [02:41:21] OK even that one is back now [02:42:56] and you still have a 10% packet loss or back to 0%? [02:43:05] packet loss is still there [02:46:12] ok interesting running mtr not in report mode shows no loss [02:51:26] Dereckson: did you observe anything on your end? [02:54:02] t0mb0: I've checked Icinga reports, all is green, but I don't have an access to all the networking infrastructure. I've filled a task with the information you reported, we'll update it when we identify some cause. [02:55:08] (03PS1) 10MaxSem: Labs: remove unused wmgNoticeProject [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328863 [02:55:10] (03PS1) 10MaxSem: Labs: remove unused wmgUseWebFonts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328864 [02:55:12] (03PS1) 10MaxSem: Labs: remove wmgUseEcho overrides [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328865 [02:55:14] (03PS1) 10MaxSem: Labs: remove wmgEchoMentionStatusNotifications [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328866 [02:55:16] (03PS1) 10MaxSem: Labs: remove TMH and MwEmbedSupport override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328867 [02:55:18] (03PS1) 10MaxSem: Labs: remove wmgUseCommonsMetadata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328868 [02:55:21] (03PS1) 10MaxSem: Labs: remove unused wmgCommonsMetadataForceRecalculate [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328869 [02:55:22] (03PS1) 10MaxSem: Labs: remove wmgUseGWToolset [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328870 [02:58:01] Dereckson: nps I guess from my perspective everything is ok ow [02:58:02] **now [02:58:18] I'm wondering whether mtr --report mode was causing some delay (bug mabye?) [02:58:41] thanks for your help none the less [03:00:57] You're welcome. [03:02:35] If mtr, could be a rate limiting. [03:27:30] PROBLEM - puppet last run on etcd1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:55:30] RECOVERY - puppet last run on etcd1003 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [04:09:10] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=339.90 Read Requests/Sec=2451.80 Write Requests/Sec=5.60 KBytes Read/Sec=33254.80 KBytes_Written/Sec=1212.40 [04:15:33] (03PS1) 10Kaldari: Setting $wgPageAssessmentsOnTalkPages to false for enwikivoyage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328874 [04:19:10] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=18.40 Read Requests/Sec=0.00 Write Requests/Sec=1.80 KBytes Read/Sec=0.00 KBytes_Written/Sec=41.20 [04:34:57] Hello I'm Kris. I'm interested in contributing to wikipedia. I'm a BSCS Student. My skills are Java,HTML CSS SQL and basics of python and Linux skills. Please help me with understanding few things. [04:35:16] 1. what is the difference between Operations team and Release Engineering Team [04:36:21] 2. I'm interested in contribution what are the skills that i require to be sucessfull contributor. thanks [05:00:40] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:29:40] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 57 seconds ago with 0 failures [05:34:46] 06Operations, 10Wikimedia-Language-setup, 10Wikimedia-Site-requests, 05Goal: Wikipedias with zh-* language codes waiting to be renamed (zh-min-nan -> nan, zh-yue -> yue, zh-classical -> lzh) - https://phabricator.wikimedia.org/T10217#2898094 (10Liuxinyu970226) [05:35:20] PROBLEM - puppet last run on analytics1027 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:59:50] PROBLEM - puppet last run on db1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [06:04:20] RECOVERY - puppet last run on analytics1027 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [06:09:50] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [06:27:50] RECOVERY - puppet last run on db1043 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [06:34:30] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Puppet has 28 failures. Last run 2 minutes ago with 28 failures. Failed resources (up to 3 shown): Package[gdb],Exec[eth0_v6_token],Package[wipe],Package[zotero/translators] [06:37:50] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [07:03:08] 06Operations, 10Gerrit, 07Beta-Cluster-reproducible, 13Patch-For-Review: gerrit jgit gc caused mediawiki/core repo problems - https://phabricator.wikimedia.org/T151676#2898185 (10hashar) [07:03:30] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [07:04:54] 06Operations, 10Gerrit, 07Beta-Cluster-reproducible, 13Patch-For-Review: gerrit jgit gc caused mediawiki/core repo problems - https://phabricator.wikimedia.org/T151676#2824332 (10hashar) T152801 is the same issue for `mediawiki/vagrant.git` which has bunch of deleted `wmf/*` branches still referencing some... [07:06:06] 06Operations, 10Gerrit, 07Beta-Cluster-reproducible, 13Patch-For-Review: gerrit jgit gc caused mediawiki/core repo problems - https://phabricator.wikimedia.org/T151676#2898191 (10hashar) [07:22:50] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [07:33:50] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [07:56:07] !log installing Python security updates [07:56:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:57] 06Operations: Some mw hosts trigger a dpkg conffile prompt when upgrading php-pear - https://phabricator.wikimedia.org/T154007#2898242 (10MoritzMuehlenhoff) [08:35:52] 06Operations: Lost session data on every save attempt - https://phabricator.wikimedia.org/T153984#2898272 (10jcrespo) > Have I added the right tag? @Tacsipacsi When in doubt, give no tags, and it will be triaged for you. You will also attract more attention if it is not tagged (while if it is tagged, people will... [08:41:12] 06Operations: Some mw hosts trigger a dpkg conffile prompt when upgrading php-pear - https://phabricator.wikimedia.org/T154007#2898279 (10MoritzMuehlenhoff) p:05Triage>03Normal [08:55:19] (03CR) 10Muehlenhoff: [C: 031] "Great, thanks for working on that." [puppet] - 10https://gerrit.wikimedia.org/r/324642 (https://phabricator.wikimedia.org/T111934) (owner: 10Filippo Giunchedi) [09:00:20] moritzm: FYI there is https://github.com/twitter/twemproxy/pull/321 that is interesting [09:17:37] oh, nice [09:23:42] !log stopping slave dbstore2001(s2) and db2034 for more table cloning [09:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:31:59] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Consider switching to HTTPS for Wikidata query service links - https://phabricator.wikimedia.org/T153563#2898359 (10Esc3300) [10:01:30] PROBLEM - puppet last run on mw1234 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:03:50] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 10.192.48.44 on port 6479 [10:04:50] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS 2.8.17 on 10.192.48.44:6479 has 1 databases (db0) with 3835388 keys, up 53 days 1 hours - replication_delay is 0 [10:11:00] 06Operations, 10MediaWiki-API, 10Parsoid, 10RESTBase, and 6 others: HHVM request timeouts not working; support lowering the API request timeout per request - https://phabricator.wikimedia.org/T97192#2898453 (10Joe) [10:29:28] 06Operations, 06Discovery, 10Performance-Metrics, 10Traffic, and 3 others: compile number of http uses for http://www.wikidata.org/entity - https://phabricator.wikimedia.org/T154017#2898485 (10Esc3300) [10:30:30] RECOVERY - puppet last run on mw1234 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [10:32:16] 06Operations, 10Traffic, 07HTTPS, 07Tracking: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681#2898501 (10Esc3300) [10:32:20] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Consider switching to HTTPS for Wikidata query service links - https://phabricator.wikimedia.org/T153563#2898500 (10Esc3300) [10:45:30] PROBLEM - puppet last run on analytics1033 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[zsh-beta] [10:54:08] 06Operations, 10Ops-Access-Requests, 10Gerrit: Root for Mukunda for Gerrit machine(s) - https://phabricator.wikimedia.org/T152236#2842574 (10mark) This is approved,. [10:55:50] 06Operations: Lost session data on every save attempt - https://phabricator.wikimedia.org/T153984#2898511 (10Tacsipacsi) Thanks for your explanation, I hope everything will be OK (it currently seems so). [11:06:36] (03CR) 10Alexandros Kosiaris: [C: 031] "Looks fine to me, but this needs some careful steps, if we want to minimize gaps/holes in the graphs. That is:" [puppet] - 10https://gerrit.wikimedia.org/r/328599 (https://phabricator.wikimedia.org/T123733) (owner: 10Dzahn) [11:12:30] RECOVERY - puppet last run on analytics1033 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [11:16:40] PROBLEM - puppet last run on ms-fe1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:16:57] (03PS1) 10ArielGlenn: add twentyafterfour to gerrit-root group [puppet] - 10https://gerrit.wikimedia.org/r/328882 (https://phabricator.wikimedia.org/T152236) [11:21:24] 06Operations, 10netops: Packet loss from Voxel to text load balancers - https://phabricator.wikimedia.org/T153998#2897912 (10akosiaris) This traceroute does not make much sense. It reports a ping time of ~100ms for the last step, but more like 10ms for the step right before that. Our infrastructure definitely... [11:21:54] (03CR) 10Paladox: [C: 031] add twentyafterfour to gerrit-root group [puppet] - 10https://gerrit.wikimedia.org/r/328882 (https://phabricator.wikimedia.org/T152236) (owner: 10ArielGlenn) [11:24:26] (03CR) 10ArielGlenn: [C: 032] add twentyafterfour to gerrit-root group [puppet] - 10https://gerrit.wikimedia.org/r/328882 (https://phabricator.wikimedia.org/T152236) (owner: 10ArielGlenn) [11:27:22] 06Operations, 10Ops-Access-Requests, 10Gerrit, 13Patch-For-Review: Root for Mukunda for Gerrit machine(s) - https://phabricator.wikimedia.org/T152236#2842574 (10ArielGlenn) Please verify that you have the rights you need, and I will then close this ticket. [11:34:30] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:39:50] PROBLEM - puppet last run on restbase1016 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:45:12] (03PS20) 10Alexandros Kosiaris: Add profile::kubernetes::node profile class [puppet] - 10https://gerrit.wikimedia.org/r/324212 [11:45:14] (03PS19) 10Alexandros Kosiaris: Include ::profile::kubernetes::node in role::kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/324213 [11:45:16] (03PS1) 10Alexandros Kosiaris: k8s::apiserver: Allow overriding the authorization [puppet] - 10https://gerrit.wikimedia.org/r/328884 [11:45:18] (03PS1) 10Alexandros Kosiaris: profile::kubernetes::master: Specify authz_mode to undef [puppet] - 10https://gerrit.wikimedia.org/r/328885 [11:45:40] RECOVERY - puppet last run on ms-fe1002 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [11:53:49] (03CR) 10Alexandros Kosiaris: "PCC happy at https://puppet-compiler.wmflabs.org/4986/ and is a noop for toollabs as well, merging" [puppet] - 10https://gerrit.wikimedia.org/r/328884 (owner: 10Alexandros Kosiaris) [11:53:52] (03CR) 10Alexandros Kosiaris: [C: 032] k8s::apiserver: Allow overriding the authorization [puppet] - 10https://gerrit.wikimedia.org/r/328884 (owner: 10Alexandros Kosiaris) [11:54:30] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:58:46] (03PS2) 10Alexandros Kosiaris: profile::kubernetes::master: Specify no authz_mode [puppet] - 10https://gerrit.wikimedia.org/r/328885 [11:58:48] (03PS21) 10Alexandros Kosiaris: Add profile::kubernetes::node profile class [puppet] - 10https://gerrit.wikimedia.org/r/324212 [11:58:50] (03PS20) 10Alexandros Kosiaris: Include ::profile::kubernetes::node in role::kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/324213 [12:00:29] (03PS1) 10Urbanecm: Enable mapframe for nowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328889 (https://phabricator.wikimedia.org/T154021) [12:02:30] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 46 seconds ago with 0 failures [12:02:45] 06Operations, 05Prometheus-metrics-monitoring, 15User-Elukey: Port memcached statistics from ganglia to prometheus - https://phabricator.wikimedia.org/T147326#2898631 (10elukey) [12:04:42] 06Operations, 10netops: Packet loss from Voxel to text load balancers - https://phabricator.wikimedia.org/T153998#2898632 (10Dereckson) The traceroute has been generated with `mtr` sending TCP packets to port 80, ie something like `mtr -4 --tcp -P 80 text-lb.esams.wikimedia.org`. An explanation I can see for... [12:06:59] 06Operations, 10DNS, 06Wikisource: Redirect mul.wikisource.org to wikisource.org - https://phabricator.wikimedia.org/T75407#2898636 (10Liuxinyu970226) [12:07:50] RECOVERY - puppet last run on restbase1016 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [12:08:17] 06Operations, 05Prometheus-metrics-monitoring, 15User-Elukey: Port memcached statistics from ganglia to prometheus - https://phabricator.wikimedia.org/T147326#2898652 (10elukey) I tried to add some metrics in https://grafana.wikimedia.org/dashboard/db/prometheus-memcached-dc-stats (still a draft). The idea i... [12:12:42] (03CR) 10Alexandros Kosiaris: [C: 032] "PCC happy at https://puppet-compiler.wmflabs.org/4988/argon.eqiad.wmnet/ merging" [puppet] - 10https://gerrit.wikimedia.org/r/328885 (owner: 10Alexandros Kosiaris) [12:14:40] 06Operations, 10MediaWiki-API, 10Parsoid, 10RESTBase, and 6 others: HHVM request timeouts not working; support lowering the API request timeout per request - https://phabricator.wikimedia.org/T97192#2898823 (10Joe) After some more extensive testing in the last couple of days, I found that the current vanil... [12:18:10] RECOVERY - puppet last run on chlorine is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [12:22:30] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [12:32:01] 06Operations: reinstall rdb100[56] with RAID - https://phabricator.wikimedia.org/T140442#2898880 (10elukey) >>! In T140442#2897044, @Dzahn wrote: > @elukey i see on rdb1005 for example it has a sda2 entirely used for /tmp, so 20G temp. That reminded me of the 2 videoscalers we reinstalled recently. is this the s... [12:34:23] (03PS6) 10Muehlenhoff: Make systemd-timesyncd available as an alternative time synchronisation provider [puppet] - 10https://gerrit.wikimedia.org/r/322279 (https://phabricator.wikimedia.org/T150257) [12:40:30] PROBLEM - puppet last run on elastic1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:55:41] 06Operations, 10netops: Packet loss from Voxel to text load balancers - https://phabricator.wikimedia.org/T153998#2898929 (10akosiaris) Why did they have to go through the trouble of using TCP ? Maybe some kind of restriction on their home network? Even in that case, the jump from <10ms to ~100ms in the very l... [13:08:30] RECOVERY - puppet last run on elastic1032 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [13:08:48] (03CR) 10Alexandros Kosiaris: [C: 031] Make systemd-timesyncd available as an alternative time synchronisation provider (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/322279 (https://phabricator.wikimedia.org/T150257) (owner: 10Muehlenhoff) [13:17:02] (03PS22) 10Alexandros Kosiaris: Add profile::kubernetes::node profile class [puppet] - 10https://gerrit.wikimedia.org/r/324212 [13:17:04] (03PS21) 10Alexandros Kosiaris: Include ::profile::kubernetes::node in role::kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/324213 [13:45:43] 06Operations, 06Discovery, 10Traffic, 06WMDE-Tech-Communication, and 3 others: announce breaking change: http > https for entities in rdf - https://phabricator.wikimedia.org/T154015#2898974 (10ema) [13:59:52] (03PS1) 10Alexandros Kosiaris: Amend k8s_infrastructure_users [labs/private] - 10https://gerrit.wikimedia.org/r/328905 [14:01:35] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] Amend k8s_infrastructure_users [labs/private] - 10https://gerrit.wikimedia.org/r/328905 (owner: 10Alexandros Kosiaris) [14:09:00] PROBLEM - puppet last run on multatuli is CRITICAL: CRITICAL: Puppet last ran 1 day ago [14:09:03] (03CR) 10Alexandros Kosiaris: [C: 032] Add profile::kubernetes::node profile class [puppet] - 10https://gerrit.wikimedia.org/r/324212 (owner: 10Alexandros Kosiaris) [14:09:11] (03CR) 10Alexandros Kosiaris: [C: 032] Include ::profile::kubernetes::node in role::kubernetes::worker [puppet] - 10https://gerrit.wikimedia.org/r/324213 (owner: 10Alexandros Kosiaris) [14:12:00] RECOVERY - puppet last run on multatuli is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures [14:12:01] (03PS1) 10Urbanecm: Add HD logos for eswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328908 (https://phabricator.wikimedia.org/T150618) [14:13:40] PROBLEM - Check systemd state on kubernetes1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:13:40] PROBLEM - puppet last run on kubernetes1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[kubernetes-node],File[/etc/kubernetes/ssl] [14:14:10] PROBLEM - puppet last run on kubernetes1002 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[kubernetes-node],File[/etc/kubernetes/ssl] [14:14:30] PROBLEM - Check systemd state on kubernetes1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:16:30] PROBLEM - Check systemd state on copper is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:21:30] RECOVERY - Check systemd state on copper is OK: OK - running: The system is fully operational [14:33:38] (03PS2) 10Ema: WIP: varnishreqstats: port to cachestats.CacheStatsSender [puppet] - 10https://gerrit.wikimedia.org/r/328688 (https://phabricator.wikimedia.org/T151643) [14:34:31] (03CR) 10jerkins-bot: [V: 04-1] WIP: varnishreqstats: port to cachestats.CacheStatsSender [puppet] - 10https://gerrit.wikimedia.org/r/328688 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema) [14:36:12] (03PS3) 10Ema: WIP: varnishreqstats: port to cachestats.CacheStatsSender [puppet] - 10https://gerrit.wikimedia.org/r/328688 (https://phabricator.wikimedia.org/T151643) [14:37:40] PROBLEM - Check systemd state on kubernetes1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:37:40] PROBLEM - puppet last run on kubernetes1004 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[kubernetes-node],File[/etc/kubernetes/ssl] [14:39:23] 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring, 15User-Elukey: Port apache httpd metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T147316#2899029 (10elukey) Created https://grafana.wikimedia.org/dashboard/db/prometheus-apache-hhvm-dc-stats as first draft for apache/... [14:39:54] 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring, 15User-Joe: Port HHVM metrics from ganglia to prometheus - https://phabricator.wikimedia.org/T147423#2899030 (10elukey) Created a dashboard as written in https://phabricator.wikimedia.org/T147316#2899029 [14:41:40] PROBLEM - Check systemd state on kubernetes1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [14:41:40] PROBLEM - puppet last run on kubernetes1003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[kubernetes-node],File[/etc/kubernetes/ssl] [14:44:40] PROBLEM - puppet last run on restbase1011 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:54:50] (03PS1) 10Elukey: Add the HHVM and Apache videoscaler clusters to Prometheus polling [puppet] - 10https://gerrit.wikimedia.org/r/328913 (https://phabricator.wikimedia.org/T147316) [14:56:37] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/328913 (https://phabricator.wikimedia.org/T147316) (owner: 10Elukey) [14:58:59] wrong ports sigh [14:59:13] elukey: they are not the same of the others? [14:59:54] I am going to do a failover test on labsdb1009 (labsdb-analytics) [14:59:58] yes yes, I just realized it, so it is pcc behaving strangely [15:00:07] it should give an alert here [15:00:30] volans: https://puppet-compiler.wmflabs.org/4991/prometheus1001.eqiad.wmnet/change.prometheus1001.eqiad.wmnet.err [15:00:53] ah it says "Detail: getaddrinfo: Name or service not known" [15:01:09] ok so pcc is not useful in here [15:01:20] cluster_config.pp:44 on node prometheus1001.eqiad.wmnet [15:01:21] Error: Failed to parse template prometheus/cluster_config.erb: [15:02:51] yes but the error message is not so clear and there is a getaddrinfo error [15:03:05] I'll ask Filippo, the videoscalers can wait :D [15:12:40] RECOVERY - puppet last run on restbase1011 is OK: OK: Puppet is currently enabled, last run 39 seconds ago with 0 failures [15:13:37] !log testing labsdb-analytics automatic failover [15:13:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:34] it took 9 seconds to switchover, not sure if that is ok, with the traffic those machines have [15:18:00] what I am not getting is an alert [15:18:50] PROBLEM - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 1 down 1 [15:19:08] ha, there it is [15:19:40] that one, I should definitely make lower, as it is very unlikely to have false positives or timeouts [15:20:18] recovering now [15:21:50] RECOVERY - haproxy failover on dbproxy1010 is OK: OK check_failover servers up 2 down 0 [15:29:50] PROBLEM - puppet last run on db1043 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [15:39:50] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [15:44:39] 06Operations, 06Analytics-Kanban, 15User-Elukey: Yarn node manager JVM memory leaks - https://phabricator.wikimedia.org/T153951#2899094 (10elukey) https://issues.apache.org/jira/browse/HADOOP-13362 (https://issues.apache.org/jira/browse/YARN-5482 has a more simple description and points to it) seems to be a... [15:57:50] RECOVERY - puppet last run on db1043 is OK: OK: Puppet is currently enabled, last run 48 seconds ago with 0 failures [16:02:43] (03PS3) 10Ema: varnishxcache: port to cachestats.CacheStatsSender [puppet] - 10https://gerrit.wikimedia.org/r/328179 (https://phabricator.wikimedia.org/T151643) [16:03:11] (03PS4) 10Ema: varnishxcache: port to cachestats.CacheStatsSender [puppet] - 10https://gerrit.wikimedia.org/r/328179 (https://phabricator.wikimedia.org/T151643) [16:06:50] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [16:07:22] (03PS5) 10Ema: varnishxcache: port to cachestats.CacheStatsSender [puppet] - 10https://gerrit.wikimedia.org/r/328179 (https://phabricator.wikimedia.org/T151643) [16:07:32] (03CR) 10Ema: [V: 032 C: 032] varnishxcache: port to cachestats.CacheStatsSender [puppet] - 10https://gerrit.wikimedia.org/r/328179 (https://phabricator.wikimedia.org/T151643) (owner: 10Ema) [16:18:12] (03PS1) 10Ema: Revert "varnishxcache: port to cachestats.CacheStatsSender" [puppet] - 10https://gerrit.wikimedia.org/r/328919 [16:18:58] (03PS2) 10Ema: Revert "varnishxcache: port to cachestats.CacheStatsSender" [puppet] - 10https://gerrit.wikimedia.org/r/328919 [16:20:07] (03CR) 10Ema: [V: 032 C: 032] Revert "varnishxcache: port to cachestats.CacheStatsSender" [puppet] - 10https://gerrit.wikimedia.org/r/328919 (owner: 10Ema) [16:28:10] PROBLEM - DPKG on db2060 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:28:15] PROBLEM - MariaDB disk space on db2060 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:28:15] PROBLEM - configured eth on db2060 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:28:33] ^ paged ^ [16:28:40] that is not me [16:28:45] did it crash? [16:29:09] possibly, not responsive to ssh, trying mgmt [16:29:14] either that or network [16:29:20] doing that [16:29:30] PROBLEM - Host db2060 is DOWN: PING CRITICAL - Packet loss = 100% [16:29:33] yeah [16:29:36] on mgmt console [16:29:40] its rebooting... [16:29:44] mmm [16:29:49] oh, not reboot, scrolling fast [16:29:55] kernel? [16:29:59] messages [16:30:29] i just pasted in wrong channel by accident [16:30:38] [27858755.813147] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [16:30:38] [27858755.853801] INFO: task xfsaild/dm-0:617 blocked for more than 120 seconds. [16:30:38] [27858755.889060] Tainted: G W 3.19.0-2-amd64 #1 [16:30:47] xfs, maybe RAID controller failure [16:31:02] [27858757.129438] INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 7, t=36768 jiffies, g=237869017, c=237869016, q=391) [16:31:02] [27858820.221410] INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 15, t=52524 jiffies, g=237869017, c=237869016, q=2300) [16:31:12] so yeah lots of the first message and then a few of the latter [16:31:17] just scrolling on console [16:31:32] * apergos looks in [16:31:34] its also not responsive [16:31:35] (page) [16:31:42] so its scrolling that along the console, but wont take input [16:31:47] it is not a critical slave, so please just powercycle it [16:31:52] doing now [16:31:56] and I will take it from there [16:32:18] should I disable the alert so it doesn't page again or better leave it? [16:32:34] so people that is not connected knows it has been handled? [16:32:34] I do not mind the pages [16:32:40] can't speak for others [16:32:44] the recovery one [16:32:51] once the first has gone off? [16:32:57] !log rebooted db2060 due to some kind of hw raid or xfs failure, wouldnt take input [16:32:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:13] jynus: ok, i'm just watching it post [16:33:15] db2060 is one of the newest ones [16:33:16] or waiting to ;] [16:33:22] which is a pity [16:33:44] lats batch of codfw [16:33:51] almost everyone is in a tz where the page will not interrupt sleep (sf now awake, eu snot sleeping for several hours yet) [16:34:15] the majority of ops dont get paged 24x7 [16:34:20] 06Operations: Lost session data on every save attempt - https://phabricator.wikimedia.org/T153984#2899128 (10Aklapper) [16:34:24] a couple of folks do cuz they are gluttons for punishment [16:34:54] but typically a page shouldnt be anything terrible, the only issue is pager fatigue if it is false alerts. [16:35:03] db2060 still posting, hps are slow [16:35:10] andddd [16:35:13] 1719-Slot 0 Drive Array - A controller failure event occurred prior to thisve> [16:35:13] power-up. (Previous lock up code = 0x13) [16:35:15] jynus: yep [16:35:19] raid controller failure [16:35:30] boooo [16:35:35] that showed during post but scrolled psat [16:35:44] but is it booting (temporary) [16:35:45] likely wont boot right, its trying now [16:35:51] we shall see! [16:35:52] ok, waiting [16:36:00] lately [16:36:04] getting failed sata links [16:36:08] we have very infrequent [16:36:15] ok, its up [16:36:22] you should be able to ssh in and investigate as needed [16:36:22] just a few seconds RAID disconnection [16:36:50] RECOVERY - Host db2060 is UP: PING OK - Packet loss = 0%, RTA = 36.24 ms [16:37:02] and thats enough to send it toppling over since it'll never catch back up frmo the fail state? [16:37:05] RECOVERY - MariaDB disk space on db2060 is OK: DISK OK [16:37:05] RECOVERY - DPKG on db2060 is OK: All packages OK [16:37:05] RECOVERY - configured eth on db2060 is OK: OK - interfaces up [16:37:08] seems like it [16:37:52] I think I have all alters on codf non-critical except disk [16:38:07] not 100% sure I should make it non-critical [16:38:31] unlike the others, that is non-recoverable, but it jumps before the host is declared dead [16:38:56] PROBLEM - mysqld processes on db2060 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [16:38:56] PROBLEM - MariaDB Slave SQL: s6 on db2060 is CRITICAL: CRITICAL slave_sql_state could not connect [16:38:56] PROBLEM - MariaDB Slave IO: s6 on db2060 is CRITICAL: CRITICAL slave_io_state could not connect [16:41:27] 06Operations, 10ops-codfw, 10DBA: db2060 crashed, probably RAID controller - https://phabricator.wikimedia.org/T154031#2899146 (10jcrespo) [16:47:19] the new start seems clean at least [16:49:26] 06Operations, 10ops-codfw, 10DBA: db2060 crashed, probably RAID controller - https://phabricator.wikimedia.org/T154031#2899146 (10RobH) When I first connected to the serial console, it wasn't accepting input, but scrolled the following: [27858755.642012] INFO: task jbd2/sda1-8:385 blocked for more than 120... [16:50:54] so the serial output was there, but the error on post for the disk failure was overwritten so i only have what i pasted in here, heh [16:51:27] 06Operations, 10ops-codfw, 10DBA: db2060 crashed, probably RAID controller - https://phabricator.wikimedia.org/T154031#2899165 (10RobH) On post, it also scrolled past: 1719-Slot 0 Drive Array - A controller failure event occurred prior to thisve power-up. (Previous lock up code = 0x13) [17:10:23] Hello. When I saw a popup from wikipedia that it need donation. I asked myself that a question how can i help wiki. It's not about donating few dollars to it. But I'm a Computer Science student . I can make myself useful to the Wikipedia a lot so that it can help the world by spreading the knowledge. So i've decided to contribute. But i'm having a hard time understanding few things about which one to choose and what [17:10:23] are skills that i need to contribute(I'm willing to learn). and I've a confusion with Operations and Release engineering Teams . Please help me thanks [17:13:12] kris1: Operations team conducts "operations" on the cluster of servers running the projects. Release engineering takes care of stuff like release versions of mediawiki, quality assurance (testing basically) and so on [17:13:33] 06Operations, 10ops-codfw, 10DBA: db2060 crashed, probably RAID controller - https://phabricator.wikimedia.org/T154031#2899241 (10jcrespo) From the OS logs: ``` Dec 23 16:36:10 db2060 kernel: [ 6.793108] ata2.01: failed to resume link (SControl 0) Dec 23 16:36:10 db2060 kernel: [ 6.829120] ata2.00: SA... [17:19:37] kris1: 'releng' is responsible for deploying updates to MediaWiki, cutting branches and packaging up and publishing releases for the public. [17:19:56] ah I see ako siaris already answered, my bad [17:23:02] 'operations' is, well, administration of the servers running MediaWiki. example: dealing with hardware issues. network infrastructure. deploying and packaging or maintaining services that aren't MediaWiki (etherpad, bots, irc servers, hhvm, varnish caches, memcached...) [17:24:15] kris1: https://www.mediawiki.org/wiki/How_to_become_a_MediaWiki_hacker is supposed to help (and yes, this is the incorrect channel; #wikimedia-tech would be correct) [17:24:17] apergos Thank you so much for the information. [17:25:07] Nemo_bis thank you :) [17:25:18] if you want to work on MediaWiki itself as a volunteer, see what Nemo said, and that's a good channel to hang out in regardless [17:25:27] here we generally deal with system admin matters [17:25:50] you'll see a bunch of the same faces in there too ;-) [17:36:58] (03PS1) 10Alexandros Kosiaris: profile::kubernetes::node: Set certs owned by root [puppet] - 10https://gerrit.wikimedia.org/r/328924 [17:37:35] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] profile::kubernetes::node: Set certs owned by root [puppet] - 10https://gerrit.wikimedia.org/r/328924 (owner: 10Alexandros Kosiaris) [17:40:10] RECOVERY - puppet last run on kubernetes1003 is OK: OK: Puppet is currently enabled, last run 22 seconds ago with 0 failures [17:40:10] RECOVERY - puppet last run on kubernetes1004 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [17:40:10] RECOVERY - puppet last run on kubernetes1001 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [17:40:11] RECOVERY - puppet last run on kubernetes1002 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [17:41:21] (03PS1) 10Alexandros Kosiaris: Use packages for kube-proxy in production [puppet] - 10https://gerrit.wikimedia.org/r/328926 [17:41:48] 06Operations, 10ops-codfw, 10DBA: db2060 crashed, probably RAID controller - https://phabricator.wikimedia.org/T154031#2899288 (10jcrespo) ``` HP ProLiant System ROM 08/02/2014 HP ProLiant System ROM - Backup 08/02/2014 HP ProLiant System ROM Bootblock 03/05/2013 HP Smart Array P420i Controller 6.00 iLO 2.03... [17:42:13] 06Operations, 10ops-codfw, 10DBA: db2060 crashed (RAID controller) - https://phabricator.wikimedia.org/T154031#2899289 (10jcrespo) [17:45:04] (03CR) 10Alexandros Kosiaris: [C: 032] Use packages for kube-proxy in production [puppet] - 10https://gerrit.wikimedia.org/r/328926 (owner: 10Alexandros Kosiaris) [17:45:36] 06Operations, 10ops-codfw, 10DBA: db2060 crashed (RAID controller) - https://phabricator.wikimedia.org/T154031#2899292 (10jcrespo) p:05Triage>03Low Leaving this open for @Marostegui and @Papaul to see, there is not much else to do except maybe "upgrading the bios" so that next time it happens that cannot... [17:45:52] 06Operations, 10ops-codfw, 10DBA: db2060 crashed (RAID controller) - https://phabricator.wikimedia.org/T154031#2899295 (10jcrespo) a:03jcrespo [17:47:57] (03PS1) 10Alexandros Kosiaris: Use the correct require function in in k8s::proxy [puppet] - 10https://gerrit.wikimedia.org/r/328928 [17:48:10] PROBLEM - puppet last run on kubernetes1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:49:11] (03CR) 10Alexandros Kosiaris: [C: 032] Use the correct require function in in k8s::proxy [puppet] - 10https://gerrit.wikimedia.org/r/328928 (owner: 10Alexandros Kosiaris) [17:49:40] PROBLEM - puppet last run on labstore1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [17:51:10] RECOVERY - puppet last run on kubernetes1001 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [17:54:30] RECOVERY - Check systemd state on kubernetes1002 is OK: OK - running: The system is fully operational [17:54:40] RECOVERY - Check systemd state on kubernetes1003 is OK: OK - running: The system is fully operational [17:54:40] RECOVERY - Check systemd state on kubernetes1001 is OK: OK - running: The system is fully operational [17:54:40] RECOVERY - Check systemd state on kubernetes1004 is OK: OK - running: The system is fully operational [17:59:59] FYI, despite the doubling of the number of videoscalers, and despite most of the v2c servers being put offline to slow down uploads, the Commons transcoding backlog is ‘still’ going up. [18:00:10] PROBLEM - check_raid on barium is CRITICAL: CRITICAL: MegaSAS 2 logical, 4 physical: a0/v1 (2 disk array) degraded [18:01:29] Revent, has the rate slowed? [18:01:37] It seems to, yes. [18:02:07] I think it’s just that, well, people keep uploading huge stuff. [18:02:34] by any significant amount? [18:03:15] I have not kept specific track, but the backlog seems to be going up ‘much’ slower than it was. [18:03:54] And failed transcodes are not increasing by much at all. [18:05:10] PROBLEM - check_raid on barium is CRITICAL: CRITICAL: MegaSAS 2 logical, 4 physical: a0/v1 (2 disk array) degraded [18:05:31] Krenair: The problem is simply lots of stuff like https://commons.wikimedia.org/wiki/File:Investitur_der_neuen_Ritter_Ordenskonvent_in_Laibach_2016.webm [18:06:25] Revent, are we talking about a rate which can be left over the holidays until someone is available? [18:06:38] it's very nearly christmas eve, a lot of ops have already signed off [18:06:51] Yeah, I don’t think it’s major current drama. [18:07:15] ok [18:07:41] (I in fact hope that the holidays will see uploads slow enough that the servers have a chance to catch up some) [18:07:49] yeah [18:08:31] Like I said, they took 2/3s of the v2c machines out of the pool yesterday to make those uploads take longer. [18:09:29] 06Operations, 06Labs, 13Patch-For-Review: Kill the labtest $realm - https://phabricator.wikimedia.org/T148717#2899304 (10AlexMonk-WMF) I still see `modules/standard/templates/mail/exim4.minimal.labtest.erb` and `modules/puppetmaster/files/labtest.hiera.yaml` in the puppet repository [18:09:30] But from what has been said, v2c (which does ‘one’ multi-pass transcode) has more processing power than the video scalers (that do as many as 10 or 12) [18:10:10] PROBLEM - check_raid on barium is CRITICAL: CRITICAL: MegaSAS 2 logical, 4 physical: a0/v1 (2 disk array) degraded [18:11:00] ^ that'll be frack [18:15:10] PROBLEM - check_raid on barium is CRITICAL: CRITICAL: MegaSAS 2 logical, 4 physical: a0/v1 (2 disk array) degraded [18:16:50] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [18:17:50] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:17:50] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [18:17:50] RECOVERY - puppet last run on labstore1001 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [18:20:10] PROBLEM - check_raid on barium is CRITICAL: CRITICAL: MegaSAS 2 logical, 4 physical: a0/v1 (2 disk array) degraded [18:21:18] Krenair: the v2c interface shows that it ‘currently’ is processing 26 large videos…. [18:22:11] (which is about the same as the video scalers) but that’s with 2/3 of their pool turned off. [18:24:50] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [18:25:10] PROBLEM - check_raid on barium is CRITICAL: CRITICAL: MegaSAS 2 logical, 4 physical: a0/v1 (2 disk array) degraded [18:26:50] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:27:37] 06Operations, 10Ops-Access-Requests, 10Gerrit, 13Patch-For-Review: Root for Mukunda for Gerrit machine(s) - https://phabricator.wikimedia.org/T152236#2842574 (10Dzahn) I confirmed the access on cobalt. ``` [cobalt:~] $ id twentyafterfour uid=4967(twentyafterfour) gid=500(wikidev) groups=500(wikidev),703(... [18:27:50] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [18:27:52] 06Operations, 10Ops-Access-Requests, 10Gerrit, 13Patch-For-Review: Root for Mukunda for Gerrit machine(s) - https://phabricator.wikimedia.org/T152236#2899343 (10Dzahn) 05Open>03Resolved a:03Dzahn [18:27:56] (03PS2) 10Urbanecm: Add HD logos for multiple wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328908 (https://phabricator.wikimedia.org/T150618) [18:30:10] PROBLEM - check_raid on barium is CRITICAL: CRITICAL: MegaSAS 2 logical, 4 physical: a0/v1 (2 disk array) degraded [18:33:38] (03PS1) 10Madhuvishy: labsdb: Fix wiki url construction in maintain_meta-p [puppet] - 10https://gerrit.wikimedia.org/r/328929 (https://phabricator.wikimedia.org/T153987) [18:35:10] PROBLEM - check_raid on barium is CRITICAL: CRITICAL: MegaSAS 2 logical, 4 physical: a0/v1 (2 disk array) degraded [18:37:19] 06Operations, 10Ops-Access-Requests, 10Gerrit: Root for Mukunda for Gerrit machine(s) - https://phabricator.wikimedia.org/T152236#2899355 (10Dzahn) [18:40:10] PROBLEM - check_raid on barium is CRITICAL: CRITICAL: MegaSAS 2 logical, 4 physical: a0/v1 (2 disk array) degraded [18:44:50] PROBLEM - puppet last run on elastic1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:45:10] PROBLEM - check_raid on barium is CRITICAL: CRITICAL: MegaSAS 2 logical, 4 physical: a0/v1 (2 disk array) degraded [18:47:06] (03CR) 10Madhuvishy: [C: 032] labsdb: Fix wiki url construction in maintain_meta-p [puppet] - 10https://gerrit.wikimedia.org/r/328929 (https://phabricator.wikimedia.org/T153987) (owner: 10Madhuvishy) [18:50:10] PROBLEM - check_raid on barium is CRITICAL: CRITICAL: MegaSAS 2 logical, 4 physical: a0/v1 (2 disk array) degraded [18:50:37] opening a task and silencing that [18:52:35] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Degraded raid on barium - https://phabricator.wikimedia.org/T154039#2899402 (10fgiunchedi) [18:54:43] (03PS3) 10Urbanecm: Add HD logos for multiple wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328908 (https://phabricator.wikimedia.org/T150618) [19:07:32] (03CR) 10Filippo Giunchedi: "> if we have a general consensus on the shell script extension" [puppet] - 10https://gerrit.wikimedia.org/r/327592 (https://phabricator.wikimedia.org/T148494) (owner: 10Dzahn) [19:10:50] PROBLEM - puppet last run on labsdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:11:45] 06Operations, 10ops-eqiad: update label/racktables visible label for thumbor100[12] - https://phabricator.wikimedia.org/T153965#2899456 (10RobH) [19:12:50] RECOVERY - puppet last run on elastic1034 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [19:26:07] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Degraded raid on barium - https://phabricator.wikimedia.org/T154039#2899402 (10Cmjohnson) This will required downtime to replace the disk. There are 4 internal 3.5" disks. I can do this next week if it works for @jeff_green [19:33:08] 06Operations: spare/unused disks on application servers - https://phabricator.wikimedia.org/T106381#2899484 (10fgiunchedi) thanks @Dzahn ! I noticed there were false positives in the list (mwlog), here's a more accurate command that weeds out those ``` neodymium:~$ sudo salt -l quiet -t 4 --output=txt 'mw*' cm... [19:36:40] (03CR) 10Filippo Giunchedi: [C: 031] Add the HHVM and Apache videoscaler clusters to Prometheus polling [puppet] - 10https://gerrit.wikimedia.org/r/328913 (https://phabricator.wikimedia.org/T147316) (owner: 10Elukey) [19:38:45] !log install libxml2 upgrade on "misc-others" [19:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:50] RECOVERY - puppet last run on labsdb1006 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [19:42:25] (03CR) 10Filippo Giunchedi: [C: 031] Make systemd-timesyncd available as an alternative time synchronisation provider [puppet] - 10https://gerrit.wikimedia.org/r/322279 (https://phabricator.wikimedia.org/T150257) (owner: 10Muehlenhoff) [19:45:40] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: Degraded raid on barium - https://phabricator.wikimedia.org/T154039#2899521 (10Jgreen) Barium has two RAID1 partitions, the disk that failed belongs to the OS /archive partition which is used to buffer logs and jenkins output for sync offhost for long-term a... [19:46:55] !log install libxml2 upgrade on bastions and mw-canary [19:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:49] (03PS1) 10Madhuvishy: labsdb: Fix maintain-meta_p to insert correct url into wiki db [puppet] - 10https://gerrit.wikimedia.org/r/328931 (https://phabricator.wikimedia.org/T153987) [19:53:26] (03CR) 10Alex Monk: [C: 031] labsdb: Fix maintain-meta_p to insert correct url into wiki db [puppet] - 10https://gerrit.wikimedia.org/r/328931 (https://phabricator.wikimedia.org/T153987) (owner: 10Madhuvishy) [19:54:25] (03CR) 10Madhuvishy: [C: 032] labsdb: Fix maintain-meta_p to insert correct url into wiki db [puppet] - 10https://gerrit.wikimedia.org/r/328931 (https://phabricator.wikimedia.org/T153987) (owner: 10Madhuvishy) [19:55:36] !log cobalt - restarted apache [19:55:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:56:10] 06Operations, 10IDS-extension, 10Wikimedia-Extension-setup, 07I18n: Deploy IDS rendering engine to production - https://phabricator.wikimedia.org/T148693#2899541 (10kaldari) [19:58:06] 06Operations, 10IDS-extension, 10Wikimedia-Extension-setup, 07I18n: Deploy IDS rendering engine to production - https://phabricator.wikimedia.org/T148693#2730181 (10kaldari) [19:59:18] 06Operations, 06Labs: Explore hosting the multimedia commons use case - https://phabricator.wikimedia.org/T152632#2899554 (10DarTar) I have concerns about hosting this data on Commons. These concerns are not technical (i.e. storage or license related) but community-related. Large-scale data imports have major... [20:04:24] 06Operations: reinstall rdb100[56] with RAID - https://phabricator.wikimedia.org/T140442#2899560 (10Dzahn) Ah! ok, but do we have RAID with mw-no-tmp.cfg ? or should we use "raid1.cfg" because rdb100[7-8] use that? As opposed to rdb100[1-6] using mw.cfg. [20:40:17] (03PS1) 10Dzahn: base: add lshw to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/328952 [20:41:15] (03CR) 10jerkins-bot: [V: 04-1] base: add lshw to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/328952 (owner: 10Dzahn) [20:41:48] (03CR) 10RobH: [C: 031] "This makes comparing hardware across the fleet a lot easier!" [puppet] - 10https://gerrit.wikimedia.org/r/328952 (owner: 10Dzahn) [20:41:50] (03PS2) 10Dzahn: base: add lshw to standard packages [puppet] - 10https://gerrit.wikimedia.org/r/328952 [21:29:34] !log restarted apache on mw canary servers [21:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:10] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:48:41] 06Operations: reinstall rdb100[56] with RAID - https://phabricator.wikimedia.org/T140442#2899772 (10Dzahn) I looked for rdb1001 in racktables and as Rob points out there you can find the linked RT ticket where we can see what they were ordered as. so i checked them all and: | host | RT ticket | | rdb1001 | 4... [21:49:46] 06Operations, 10ops-codfw, 10DBA: db2060 crashed (RAID controller) - https://phabricator.wikimedia.org/T154031#2899774 (10Marostegui) I haven't checked in much detail, but from the logs it looks like just a controller crash indeed. We can upgrade the BIOS once we have some spare time now that it is easy to d... [21:50:30] 06Operations: reinstall rdb100[56] with RAID - https://phabricator.wikimedia.org/T140442#2899775 (10elukey) >>! In T140442#2899560, @Dzahn wrote: > Ah! ok, but do we have RAID with mw-no-tmp.cfg ? or should we use "raid1.cfg" because rdb100[7-8] use that? As opposed to rdb100[1-6] using mw.cfg. Sorry I didn't g... [21:53:02] 06Operations: reinstall rdb100[56] with RAID - https://phabricator.wikimedia.org/T140442#2899779 (10Dzahn) 4281 - " 6 high performance misc servers, with SSDs in addition to the normal disks." 4712 - "2 x 500GB HDD" 527 - "large scale purchase for eqiad buildout" hmm.. so missing the RT ticket for 1005/1006... [22:00:10] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [22:07:35] Anyone able to look into a database error i got when trying to create an acount or does it want to go through phab [22:10:37] mutante ^^ [22:15:05] Amortias hi, could you do a phab task please [22:15:14] Hi. [22:15:18] Not many ops around as it is the holiday. [22:15:21] paladox - doing [22:15:25] thanks. [22:15:27] Amortias: you try to create an account on a Wikimedia project? [22:15:45] EN, requet logged through the CC interface [22:15:59] Did you receive an alphanumeric error message? [22:16:03] generates an error of A database query error has occurred. This may indicate a bug in the software.[WF2ejApAAEMAAhqIOOoAAAAG] 2016-12-23 22:00:46: Fatal exception of type "DBQueryError" [22:16:06] ah thanks [22:16:18] I'm looking this code in our logs. [22:16:53] want me to assign the phab ticket im about to create to you or hold off for the moment on logign one [22:17:10] It's well a database error. [22:17:59] The global SUL account has been created, but not the en.wikipedia one. [22:18:19] bblack: ema: yt? [22:19:28] Amortias: could you suggest to the user to try to login with the account you created? There are some chances, as the SUL account has been created, it will retry to create the account locally on en. (like it would do on any other Wikimedia project) [22:20:22] Would have to send them an e-mail to that effect but can do so, wasnt showing in the logs of my created accouts on en so wasnt sure if it had gone through or not [22:20:47] Amortias: a pure deadlock database error has unfortunately occured at the moment it tried to save some user preferences, so all the transaction has been rolled back. [22:21:32] Amortias: I didn't met this case before, so I can't be 100% confident, but I'm pretty sure it was a one shot, not reproductible issue, and next retry should work. [22:21:58] Our logs show the SUL account has successfully been created. [22:22:28] cheers, wont bother with the ticket [22:23:09] 06Operations: reinstall rdb100[56] with RAID - https://phabricator.wikimedia.org/T140442#2899875 (10Dzahn) installed "lshw" on rdb1005/1006. It shows we have an unused "sdb" here that has an NTFS partition on both. so yea, 2 HDDs here with 500GB each. rdb1001 is the same but both drives have Linux partitions.... [22:23:41] Amortias: but https://tools.wmflabs.org/quentinv57-tools/tools/sulinfo.php?username=RM+Moabi doesn't show it [22:24:04] I don't know what queries quentinv57-tools does [22:24:40] its the one we use on the ACC interface to check so that suggests its pretty thorough [22:28:34] It doesn't appear on the CentralAuth database either. Sorry, the log message only meant it *was* created successfully, but the rollback seems to include the global account too, to avoid a situation where a global account exists, but not a local one. [22:28:44] Good news is you can retry to recreate it. [22:28:59] We'll see if this time it's okay, or if there is still a deadlock. [22:29:54] Dereckson - looks like its gone through this time [22:30:01] from my end anyway [22:30:15] It's also in the CentralAuth database [22:30:42] And it well appears on sulinfo: https://tools.wmflabs.org/quentinv57-tools/tools/sulinfo.php?username=RM+Moabi [22:31:54] And no error in the logs this time. All looks good./ [22:31:58] cheers for your time [22:33:40] Now perhaps should we create a task "Provide instructions when a database error occurs when creating an account". [22:36:50] PROBLEM - puppet last run on ms-be1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:40:10] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [22:44:39] bblack: ema: gotta run... my question was, who would be a contact for Maxmind? We found a pattern of lower impression rate for subregions that Maxmind labels as "Unknown". https://phabricator.wikimedia.org/T152650#2899891 [22:45:17] If u have any comments, pls don't hesitate to ping me (I should get all backscroll) or comment on the task... Many thanks in advance!! [22:49:54] AndyRussG: maybe this https://support.maxmind.com/geoip-data-correction-request/ [22:52:20] mutante: hey! thanks!! mm now really gotta run!! ;p [23:05:50] RECOVERY - puppet last run on ms-be1023 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [23:07:10] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [23:40:10] PROBLEM - puppet last run on cp3020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues