[00:00:35] 06Operations, 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: decom barium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T162952#3188450 (10Dzahn) @Jgreen @Robh @Cmjohnson since the combination of decom and fundraising doesn't happen that often i am unsure about the workflow here. I did check the... [00:02:06] 06Operations, 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: decom barium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T162952#3188455 (10Dzahn) it might also appear in the pfw and other fundraising config in other repos. is that also part of it? [00:05:19] 06Operations: acpi_pad consuming 100% CPU on tin - https://phabricator.wikimedia.org/T163158#3188460 (10Dzahn) also removed the module and blacklisted it on all 16 `R320` servers now. so this should not happen again. see parent task for more details. [00:19:57] 06Operations, 10ops-codfw: Degraded RAID on heze - https://phabricator.wikimedia.org/T163087#3185569 (10Dzahn) since the ops-monitoring-bot created this with an actual "CHECK_NRPE: Socket timeout" i first assumed this is a false positive. after figuring out the command line used by NRPE, i found it DOES show... [00:26:02] 06Operations, 10DBA, 10Traffic: dbtree: make wasat a working backend and become active-active - https://phabricator.wikimedia.org/T163141#3188507 (10Dzahn) not sure if this should be tagged as traffic or not. please feel free to remove it. it just got auto-added because it copies tags when you create somethi... [00:29:52] 06Operations, 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: decom barium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T162952#3188529 (10Jgreen) We probably should have followed a template of some kind... - remove from DNS - disk wipe - physical disposal - update racktables - router config chang... [00:33:31] 06Operations, 10ops-eqiad, 06DC-Ops, 10fundraising-tech-ops: decom barium.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T162952#3188531 (10Dzahn) >>! In T162952#3188529, @Jgreen wrote: > We probably should have followed a template of some kind... @Robh has made [[ https://wikitech.wikimedia.org/wi... [00:36:48] !log catrope@tin Synchronized php-1.29.0-wmf.20/extensions/MobileFrontend/resources/mobile.mainMenu/mainmenu.less: T163059 (duration: 03m 07s) [00:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:56] T163059: MobileFrontend hamburger menu broken in RTL in Wikimedia in production - https://phabricator.wikimedia.org/T163059 [00:40:10] jdlrobson: ---^^ is deployed now [00:40:23] (03CR) 10Milimetric: "Oh, I was told this repository auto-deploys via puppet. Are you saying merges to it are not allowed during deployment freezes?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348625 (owner: 10Milimetric) [00:48:07] PROBLEM - salt-minion processes on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:48:57] RECOVERY - salt-minion processes on thumbor1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:11:48] (03PS3) 10BryanDavis: Update links to Tool Labs apt repository [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/339201 (https://phabricator.wikimedia.org/T158383) (owner: 10Tim Landscheidt) [01:11:50] (03PS3) 10BryanDavis: Refactor apt-get actions in Dockerfiles [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/337880 [01:18:02] (03PS1) 10Dzahn: monitoring: fix wrong parameter bug in file ownership check [puppet] - 10https://gerrit.wikimedia.org/r/348664 [01:19:19] (03CR) 10jerkins-bot: [V: 04-1] monitoring: fix wrong parameter bug in file ownership check [puppet] - 10https://gerrit.wikimedia.org/r/348664 (owner: 10Dzahn) [01:27:34] (03CR) 10BryanDavis: [C: 032] Update links to Tool Labs apt repository [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/339201 (https://phabricator.wikimedia.org/T158383) (owner: 10Tim Landscheidt) [01:27:42] (03CR) 10BryanDavis: [C: 032] Refactor apt-get actions in Dockerfiles [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/337880 (owner: 10BryanDavis) [01:28:02] (03Merged) 10jenkins-bot: Update links to Tool Labs apt repository [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/339201 (https://phabricator.wikimedia.org/T158383) (owner: 10Tim Landscheidt) [01:28:08] (03Merged) 10jenkins-bot: Refactor apt-get actions in Dockerfiles [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/337880 (owner: 10BryanDavis) [01:29:44] (03PS2) 10Dzahn: monitoring: fix wrong parameter bug in file ownership check [puppet] - 10https://gerrit.wikimedia.org/r/348664 [01:30:33] (03PS3) 10Dzahn: monitoring: fix wrong parameter bug in file ownership check [puppet] - 10https://gerrit.wikimedia.org/r/348664 [01:30:59] (03PS4) 10Dzahn: monitoring: fix wrong parameter bug in file ownership check [puppet] - 10https://gerrit.wikimedia.org/r/348664 [01:31:59] (03CR) 10jerkins-bot: [V: 04-1] monitoring: fix wrong parameter bug in file ownership check [puppet] - 10https://gerrit.wikimedia.org/r/348664 (owner: 10Dzahn) [01:37:16] (03PS5) 10Dzahn: monitoring: fix wrong parameter bug in file ownership check [puppet] - 10https://gerrit.wikimedia.org/r/348664 [01:38:01] (03PS1) 10Dzahn: base::service_unit: add symlink from /etc into /var for systemd units [puppet] - 10https://gerrit.wikimedia.org/r/348665 [01:40:38] (03CR) 10Dzahn: "i expected unit files in /etc/systemd (as files or as symlinks into /lib/systemd) but found it confusing that they didn't show up in /etc/" [puppet] - 10https://gerrit.wikimedia.org/r/348665 (owner: 10Dzahn) [01:59:00] (03PS1) 10Dzahn: netboot: fix/adjust partman config for rdb servers [puppet] - 10https://gerrit.wikimedia.org/r/348666 (https://phabricator.wikimedia.org/T140442) [02:01:37] (03PS1) 10Dzahn: monitoring: add timeout parameter to bad_directory_owner check [puppet] - 10https://gerrit.wikimedia.org/r/348667 [02:02:45] (03CR) 10jerkins-bot: [V: 04-1] monitoring: add timeout parameter to bad_directory_owner check [puppet] - 10https://gerrit.wikimedia.org/r/348667 (owner: 10Dzahn) [02:03:51] (03PS2) 10Dzahn: monitoring: add timeout parameter to bad_directory_owner check [puppet] - 10https://gerrit.wikimedia.org/r/348667 [02:04:36] (03PS3) 10Dzahn: monitoring: add timeout parameter to bad_directory_owner check [puppet] - 10https://gerrit.wikimedia.org/r/348667 [02:09:03] (03PS2) 1020after4: Phab: create some task types and corresponding custom fields. [puppet] - 10https://gerrit.wikimedia.org/r/345618 (https://phabricator.wikimedia.org/T93499) [02:30:43] 06Operations, 13Patch-For-Review: acpi_pad issues - https://phabricator.wikimedia.org/T162850#3188754 (10BBlack) [02:30:45] 06Operations, 10ops-codfw, 10Traffic: baham (ns1) CPU-related issues - https://phabricator.wikimedia.org/T159870#3188751 (10BBlack) 05Open>03Resolved a:03Dzahn I'll close it for now. If we see more strange issues with super-low cpu freqs we can always search these up to correlate I guess. [03:01:08] (03CR) 10Chad: Move contribution tracking config to CommonSettings.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/342857 (https://phabricator.wikimedia.org/T147479) (owner: 10Chad) [03:12:17] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1083.40 Read Requests/Sec=3607.70 Write Requests/Sec=17.00 KBytes Read/Sec=20199.60 KBytes_Written/Sec=4227.60 [03:22:27] PROBLEM - dhclient process on thumbor1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:23:17] RECOVERY - dhclient process on thumbor1002 is OK: PROCS OK: 0 processes with command name dhclient [03:27:17] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1130.60 Read Requests/Sec=174.60 Write Requests/Sec=2.60 KBytes Read/Sec=5855.20 KBytes_Written/Sec=365.20 [03:28:17] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=177.50 Read Requests/Sec=171.70 Write Requests/Sec=2.00 KBytes Read/Sec=4222.40 KBytes_Written/Sec=416.00 [04:04:31] (03PS1) 10EBernhardson: Install ::statistics::packages to stat1004 [puppet] - 10https://gerrit.wikimedia.org/r/348669 (https://phabricator.wikimedia.org/T163177) [04:06:08] 06Operations, 10ops-codfw, 06Performance-Team, 15User-fgiunchedi: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538#3188836 (10Krinkle) @fgiunchedi So where is the data now? [04:09:17] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=7827.20 Read Requests/Sec=8640.60 Write Requests/Sec=9.50 KBytes Read/Sec=34972.40 KBytes_Written/Sec=249.20 [04:12:17] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=10.90 Read Requests/Sec=5.50 Write Requests/Sec=20.60 KBytes Read/Sec=38.80 KBytes_Written/Sec=594.40 [04:15:31] (03CR) 10EBernhardson: [C: 04-1] "puppet compiler doesn't like this one, complaining that openjdk-7-jdk gets declared twice. I poked around and there isn't anything obvious" [puppet] - 10https://gerrit.wikimedia.org/r/348669 (https://phabricator.wikimedia.org/T163177) (owner: 10EBernhardson) [05:58:30] (03CR) 10Marostegui: [C: 031] mariadb: grant user 'phstats' additional select on differential db (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/348565 (owner: 10Dzahn) [06:16:51] !log For the record: restarted s7 instance on db1069 - T163183 [06:16:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:16:59] T163183: db1069 s7 replication thread stuck on fawiki.flaggedrevs_tracking - https://phabricator.wikimedia.org/T163183 [06:23:09] (03PS8) 10Muehlenhoff: Mark wireshark-common/install-setuid as seen to avoid debconf prompt [puppet] - 10https://gerrit.wikimedia.org/r/346162 [07:04:08] (03CR) 10Giuseppe Lavagetto: Allow suppressing SAN warnings from urllib3 (032 comments) [software/conftool] - 10https://gerrit.wikimedia.org/r/347591 (owner: 10Giuseppe Lavagetto) [07:04:27] (03CR) 10Giuseppe Lavagetto: [C: 032] Add a free-form 'any' type [software/conftool] - 10https://gerrit.wikimedia.org/r/347356 (https://phabricator.wikimedia.org/T156924) (owner: 10Giuseppe Lavagetto) [07:04:57] (03CR) 10Giuseppe Lavagetto: "I am pretty sure the logic is ok as far as data structures are concerned." [software/conftool] - 10https://gerrit.wikimedia.org/r/347356 (https://phabricator.wikimedia.org/T156924) (owner: 10Giuseppe Lavagetto) [07:10:28] (03PS2) 10Giuseppe Lavagetto: Allow suppressing SAN warnings from urllib3 [software/conftool] - 10https://gerrit.wikimedia.org/r/347591 [07:19:47] (03CR) 10Muehlenhoff: [C: 032] Mark wireshark-common/install-setuid as seen to avoid debconf prompt [puppet] - 10https://gerrit.wikimedia.org/r/346162 (owner: 10Muehlenhoff) [07:24:24] (03CR) 10Jcrespo: "I am not sure if we should do this anymore, and just add s1-master.codfw.wmnet, etc. instead. Thinking mostly about long term, and a poten" [dns] - 10https://gerrit.wikimedia.org/r/348440 (https://phabricator.wikimedia.org/T155099) (owner: 10Marostegui) [07:28:13] (03CR) 10Marostegui: "> I am not sure if we should do this anymore, and just add" [dns] - 10https://gerrit.wikimedia.org/r/348440 (https://phabricator.wikimedia.org/T155099) (owner: 10Marostegui) [07:33:49] 06Operations, 10ops-codfw, 06DC-Ops, 13Patch-For-Review, 15User-Elukey: Reclaim/Decommission mw2090->mw2096 (OOW) - https://phabricator.wikimedia.org/T161488#3189079 (10MoritzMuehlenhoff) mw2092 is still showing up in servermon: https://servermon.wikimedia.org/hosts/ [07:37:41] (03CR) 10Giuseppe Lavagetto: [C: 032] Allow suppressing SAN warnings from urllib3 [software/conftool] - 10https://gerrit.wikimedia.org/r/347591 (owner: 10Giuseppe Lavagetto) [07:42:57] <_joe_> !log cleaning up orphaned COW images in /var/cache/pbuilder/build/ on copper [07:43:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:48:07] <_joe_> !log uploaded python-conftool 0.4.1 to jessie-wikimedia [07:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:40] (03CR) 10Hoo man: [C: 031] "Fine to deploy at any time" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348413 (https://phabricator.wikimedia.org/T159851) (owner: 10Ladsgroup) [07:50:56] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/348478 (owner: 10RobH) [08:25:05] (03PS1) 10Muehlenhoff: Move gerrit to using Bouncycastle as packaged by Debian [puppet] - 10https://gerrit.wikimedia.org/r/348690 (https://phabricator.wikimedia.org/T163185) [08:37:40] (03CR) 10DCausse: [C: 031] Align elasticsearch jvm options with upstream [puppet] - 10https://gerrit.wikimedia.org/r/345632 (https://phabricator.wikimedia.org/T161830) (owner: 10EBernhardson) [08:39:57] PROBLEM - tools homepage -admin tool- on tools.wmflabs.org is CRITICAL: CRITICAL - Socket timeout after 20 seconds [08:40:37] RECOVERY - tools homepage -admin tool- on tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 3670 bytes in 0.013 second response time [09:00:28] (03CR) 10Volans: "yeah but right now is s1-master.eqiad.wmnet, also when pointing at codfw hosts, so conceptually wrong." [dns] - 10https://gerrit.wikimedia.org/r/348440 (https://phabricator.wikimedia.org/T155099) (owner: 10Marostegui) [09:00:40] 06Operations, 13Patch-For-Review, 15User-Elukey, 07Wikimedia-log-errors: Warning: timed out after 0.2 seconds when connecting to rdb1001.eqiad.wmnet [110]: Connection timed out - https://phabricator.wikimedia.org/T125735#3189263 (10elukey) After some (Easter) thinking I am convinced that the most probable... [09:03:14] !log upgrading conftool to v0.4.1 on neodymium/sarin [09:03:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:01] * elukey changed the channel's topic: Riccardo don't do dangerous updates before the 19th :P [09:08:23] * elukey runs [09:08:35] elukey: :D it's to avoid the annoying urllib3 SAN warning [09:11:27] <_joe_> elukey: it's not dangerous, and was actually my request [09:11:53] <_joe_> elukey: it's literally just adding a new field type to conftool [09:12:07] <_joe_> and disabling the SAN warning [09:12:16] ahahahah it was only an excuse to mock volans, nothing more :) [09:12:34] <_joe_> oh ok sorry, then. You can go on. [09:12:52] <_joe_> volans: WTF! upgrading a fundamental software the day of the switchovers!!1! [09:13:11] :-P [09:17:07] !log oblivian@neodymium conftool action : set/pooled=true; selector: dnsdisc=zotero,name=codfw [09:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:21] (03CR) 10Muehlenhoff: "It's probably best if we migrate statistics::packages to require_package()" [puppet] - 10https://gerrit.wikimedia.org/r/348669 (https://phabricator.wikimedia.org/T163177) (owner: 10EBernhardson) [09:23:19] 06Operations, 10netops: analytics hosts frequently tripping 'port utilization threshold' librenms alerts - https://phabricator.wikimedia.org/T133852#3189302 (10fgiunchedi) @ayounsi sounds good to me! I think for the longer period of time we can start with 3x (or 2x) the current 5min and see if that helps. Usua... [09:24:53] 06Operations, 10puppet-compiler: hosts with puppet compiler failures on every run - https://phabricator.wikimedia.org/T162949#3181168 (10fgiunchedi) I believe at least bast* and prometheus* are due to {T150456} [09:26:49] 06Operations, 10netops: analytics hosts frequently tripping 'port utilization threshold' librenms alerts - https://phabricator.wikimedia.org/T133852#3189317 (10faidon) Yup, sounds like a good idea. Thanks for taking care of that @ayounsi :) [09:26:51] 06Operations, 10Traffic, 10media-storage: swift-object-server 1.13.1: Wrong Content-Type returned on 304 Not Modified responses - https://phabricator.wikimedia.org/T162348#3189319 (10fgiunchedi) FWIW the swift 2.2.0 upgrade is complete (from T162609) [09:28:57] test [09:29:54] test [09:30:11] test [09:30:29] <_joe_> sorry, me doing tests ^^ [09:30:34] _joe_: test [09:30:38] :-P [09:32:36] (03CR) 10Alexandros Kosiaris: "'<<' syntax errors are fails to merge/rebase and as a result the resulting file on the PCC has conflicts that need to be resolved. Gerrit " [puppet] - 10https://gerrit.wikimedia.org/r/347023 (owner: 10Dzahn) [09:33:15] test [09:33:47] (03CR) 10Alexandros Kosiaris: [C: 031] standardize "include ::profile:*", "include ::nrpe" [puppet] - 10https://gerrit.wikimedia.org/r/347023 (owner: 10Dzahn) [09:36:11] 06Operations: Migrate all jessie hosts to Linux 4.9 - https://phabricator.wikimedia.org/T162029#3189339 (10ema) [09:36:13] 06Operations, 13Patch-For-Review: codfw/eqiad hosts occasionally spend > 3 minutes starting networking.service with linux 4.9 - https://phabricator.wikimedia.org/T162612#3189336 (10ema) 05Open>03Resolved a:03ema Blacklisting intel_uncore fixed the problem. [09:41:38] 06Operations, 10ops-codfw, 06Performance-Team, 15User-fgiunchedi: Plug in ex-graphite2001 SSDs to recover coal data - https://phabricator.wikimedia.org/T161538#3134685 (10fgiunchedi) @Krinkle on graphite2001, I've opened {T163194} to followup on the actual backfill. Note I won't have to work on it this wee... [09:42:37] 06Operations, 06Performance-Team, 15User-fgiunchedi: Backfill restored coal whisper files with current data - https://phabricator.wikimedia.org/T163194#3189342 (10fgiunchedi) [09:44:45] (03CR) 10Alexandros Kosiaris: [C: 032] RESTBase: Add the CXServer service URI [puppet] - 10https://gerrit.wikimedia.org/r/348154 (https://phabricator.wikimedia.org/T107914) (owner: 10Mobrovac) [09:44:49] (03PS2) 10Alexandros Kosiaris: RESTBase: Add the CXServer service URI [puppet] - 10https://gerrit.wikimedia.org/r/348154 (https://phabricator.wikimedia.org/T107914) (owner: 10Mobrovac) [09:44:53] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] RESTBase: Add the CXServer service URI [puppet] - 10https://gerrit.wikimedia.org/r/348154 (https://phabricator.wikimedia.org/T107914) (owner: 10Mobrovac) [09:45:09] <_joe_> !log adding 60G to the ocg output partition on ocg1003 [09:45:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:20] <_joe_> !log testing switchover script for services, will act on zotero in codfw [09:50:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:33] <_joe_> uhm [09:52:15] !log oblivian: Setting zotero in codfw UP [09:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:01:13] (03CR) 10Alexandros Kosiaris: [C: 04-1] "It's an interesting approach, I think it should work. Minor comment inline, otherwise LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/347984 (owner: 10Dzahn) [10:01:15] (03CR) 10Filippo Giunchedi: [C: 031] adds naos to everywhere mira is listed [puppet] - 10https://gerrit.wikimedia.org/r/348478 (owner: 10RobH) [10:03:21] (03PS1) 10Ema: Revert "cache_upload: lower keep from 3d to 1d on upload backends" [puppet] - 10https://gerrit.wikimedia.org/r/348698 (https://phabricator.wikimedia.org/T162035) [10:09:51] (03PS1) 10Ema: Revert "cache_upload: override CT updates on 304s" [puppet] - 10https://gerrit.wikimedia.org/r/348699 (https://phabricator.wikimedia.org/T162035) [10:10:08] (03CR) 10Alexandros Kosiaris: [C: 031] monitoring: add timeout parameter to bad_directory_owner check [puppet] - 10https://gerrit.wikimedia.org/r/348667 (owner: 10Dzahn) [10:13:22] (03CR) 10Alexandros Kosiaris: [C: 031] monitoring: fix wrong parameter bug in file ownership check [puppet] - 10https://gerrit.wikimedia.org/r/348664 (owner: 10Dzahn) [10:20:14] !log uploaded HHVM 3.18.2+wmf2 for jessie-wikimedia/experimental (includes fix for T162354) [10:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:21] T162354: Frequent TCP RST on connections between HHVM and Redis - https://phabricator.wikimedia.org/T162354 [10:23:08] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, also one OCD-nit" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/348699 (https://phabricator.wikimedia.org/T162035) (owner: 10Ema) [10:25:06] !log installing wireshark security updates [10:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:26] !log Final test of switchdc steps in the codfw->eqiad configuration, only idempotent changes, T160178 [10:25:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:25:33] T160178: MediaWiki Datacenter Switchover automation - https://phabricator.wikimedia.org/T160178 [10:26:29] !log switchdc (volans@sarin) START TASK - switchdc.stages.t00_disable_puppet(codfw, eqiad) Disabling puppet on selected hosts [10:26:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:56] !log switchdc (volans@sarin) END TASK - switchdc.stages.t00_disable_puppet(codfw, eqiad) Successfully completed [10:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:05] 06Operations: Puppet facts around the primary network interface and IPv4/IPv6 address - https://phabricator.wikimedia.org/T163196#3189426 (10faidon) [10:28:38] !log switchdc (volans@sarin) START TASK - switchdc.stages.t00_reduce_ttl(codfw, eqiad) Reduce the TTL of all the MediaWiki discovery records [10:28:39] !log switchdc (volans@sarin) END TASK - switchdc.stages.t00_reduce_ttl(codfw, eqiad) Failed to execute [10:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:27] (03CR) 10Alexandros Kosiaris: [C: 04-1] "syntactically this looks correct, but I am not sure it actually works." [puppet] - 10https://gerrit.wikimedia.org/r/348184 (https://phabricator.wikimedia.org/T161563) (owner: 10Ladsgroup) [10:31:19] !log switchdc (volans@sarin) START TASK - switchdc.stages.t00_disable_puppet(codfw, eqiad) Disabling puppet on selected hosts [10:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:26] !log switchdc (volans@sarin) END TASK - switchdc.stages.t00_disable_puppet(codfw, eqiad) Successfully completed [10:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:33] !log switchdc (volans@sarin) START TASK - switchdc.stages.t00_reduce_ttl(codfw, eqiad) Reduce the TTL of all the MediaWiki discovery records [10:31:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:46] !log switchdc (volans@sarin) END TASK - switchdc.stages.t00_reduce_ttl(codfw, eqiad) Successfully completed [10:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:42] !log switchdc (volans@sarin) START TASK - switchdc.stages.t01_stop_maintenance(codfw, eqiad) Stop MediaWiki maintenance in the old master DC [10:33:45] !log switchdc (volans@sarin) END TASK - switchdc.stages.t01_stop_maintenance(codfw, eqiad) Failed to execute [10:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:57] PROBLEM - puppet last run on wtp2011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tshark] [10:41:57] RECOVERY - puppet last run on wtp2011 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [10:43:53] !log switchdc (volans@sarin) START TASK - switchdc.stages.t02_start_mediawiki_readonly(codfw, eqiad) Set MediaWiki in read-only mode (db_from config already merged and git pulled) [10:43:54] !log switchdc (volans@sarin) END TASK - switchdc.stages.t02_start_mediawiki_readonly(codfw, eqiad) Successfully completed [10:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:08] 06Operations, 10DBA, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#3189446 (10Marostegui) [10:46:55] 06Operations, 10DBA, 13Patch-For-Review: Decommission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#2266762 (10Marostegui) [10:48:39] !log switchdc (volans@sarin) START TASK - switchdc.stages.t03_coredb_masters_readonly(codfw, eqiad) set core DB masters in read-only mode [10:48:43] !log switchdc (volans@sarin) END TASK - switchdc.stages.t03_coredb_masters_readonly(codfw, eqiad) Failed to execute [10:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:57] expected, eqiad is RW ofc [10:55:53] !log switchdc (volans@sarin) START TASK - switchdc.stages.t05_switch_datacenter(codfw, eqiad) Switch MediaWiki configuration to the new datacenter [10:55:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:00] !log switchdc (volans@sarin) END TASK - switchdc.stages.t05_switch_datacenter(codfw, eqiad) Successfully completed [10:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:38] !log switchdc (volans@sarin) START TASK - switchdc.stages.t05_switch_traffic(codfw, eqiad) Switch traffic flow to the appservers in the new datacenter [10:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:31] !log switchdc (volans@sarin) END TASK - switchdc.stages.t05_switch_traffic(codfw, eqiad) Successfully completed [10:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:37] PROBLEM - puppet last run on kafka1012 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:06:43] (03CR) 10Filippo Giunchedi: [C: 031] varnish: swap around backend ttl cap and keep values [2/2] [puppet] - 10https://gerrit.wikimedia.org/r/343845 (https://phabricator.wikimedia.org/T124954) (owner: 10Ema) [11:07:04] (03CR) 10Filippo Giunchedi: [C: 031] Revert "cache_upload: lower keep from 3d to 1d on upload backends" [puppet] - 10https://gerrit.wikimedia.org/r/348698 (https://phabricator.wikimedia.org/T162035) (owner: 10Ema) [11:09:48] (03PS2) 10Ema: Revert "cache_upload: override CT updates on 304s" [puppet] - 10https://gerrit.wikimedia.org/r/348699 (https://phabricator.wikimedia.org/T162035) [11:10:21] (03PS1) 10Giuseppe Lavagetto: Separate the logic to stop videoscalers as they run upstart [switchdc] - 10https://gerrit.wikimedia.org/r/348703 [11:14:06] !log upgrading logstash* to Linux 4.9 [11:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:17] 06Operations, 13Patch-For-Review, 15User-Elukey: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850#3189499 (10akosiaris) >>! In T159850#3182388, @elukey wrote: > From http://docs.aws.amazon.com/AmazonElastiCache/latest/UserGuide/Replication.Redis.Versions.html: >... [11:18:26] !log switchdc (volans@sarin) START TASK - switchdc.stages.t06_redis(codfw, eqiad) Switch the Redis replication [11:18:30] !log switchdc (volans@sarin) END TASK - switchdc.stages.t06_redis(codfw, eqiad) Successfully completed [11:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:55] godog, I hear you may have setup the swift hosts for the beta cluster on labs? [11:19:15] addshore: that is correct yeah! [11:20:09] godog: I'm currently working on Extension:FileImporter & https://phabricator.wikimedia.org/T161012 and need to have swift for my dev setup. [11:20:40] I tried setting up swift in a docker container, and the swift side of things seemed to work, but couldnt make mediawiki fully interact correctly. [11:21:15] Is there something super obvious that I am probably missing? I guess i have to do some additional setup on swift to make mediawiki work [11:22:29] I was also thinking if it would be plausible to setup swift on labs & then talk to it from my dev machine, heh [11:23:16] 06Operations, 10vm-requests, 13Patch-For-Review: Site: 2 VM request for tendril (switch tendril from einsteinium to dbmonitor*) - https://phabricator.wikimedia.org/T149557#3189500 (10akosiaris) I think the only thing left to do is assess that dbmonitors are OK and then proceed with switching the DNS CNAME, s... [11:24:13] addshore: if mw can talk to swift with the right credentials then filebackend needs its containers created with setZoneAccess.php IIRC [11:24:33] "then filebackend needs its containers created with setZoneAccess.php IIRC" ack, that sounds like something I am missing [11:24:53] and yeh, mw seems to auth with swift just fine :) Thanks, I'll have a look at that script and poke around a bit more :D [11:25:21] addshore: np! yeah if mw can talk to swift fine then that should be the next step [11:28:37] RECOVERY - puppet last run on kafka1012 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [11:30:12] !log switchdc (volans@sarin) START TASK - switchdc.stages.t07_coredb_masters_readwrite(codfw, eqiad) set core DB masters in read-write mode [11:30:16] !log switchdc (volans@sarin) END TASK - switchdc.stages.t07_coredb_masters_readwrite(codfw, eqiad) Successfully completed [11:30:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:03] !log switchdc (volans@sarin) START TASK - switchdc.stages.t08_stop_mediawiki_readonly(codfw, eqiad) Set MediaWiki in read-write mode (db_to config already merged and git pulled) [11:31:04] !log switchdc (volans@sarin) END TASK - switchdc.stages.t08_stop_mediawiki_readonly(codfw, eqiad) Successfully completed [11:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:18] !log switchdc (volans@sarin) START TASK - switchdc.stages.t09_restore_ttl(codfw, eqiad) Restore the TTL of all the MediaWiki discovery records [11:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:30] !log switchdc (volans@sarin) END TASK - switchdc.stages.t09_restore_ttl(codfw, eqiad) Successfully completed [11:33:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:08] !log switchdc (volans@sarin) START TASK - switchdc.stages.t09_tendril(codfw, eqiad) Update Tendril configuration for the new masters [11:34:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:26] !log switchdc (volans@sarin) END TASK - switchdc.stages.t09_tendril(codfw, eqiad) Successfully completed [11:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:13] !log switchdc (volans@sarin) START TASK - switchdc.stages.t09_tendril(eqiad, codfw) Update Tendril configuration for the new masters [11:35:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:21] testing tendril on both ways [11:35:28] !log switchdc (volans@sarin) END TASK - switchdc.stages.t09_tendril(eqiad, codfw) Successfully completed [11:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:35:37] PROBLEM - salt-minion processes on thumbor1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:36:27] RECOVERY - salt-minion processes on thumbor1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [11:37:22] !log switchdc (volans@sarin) START TASK - switchdc.stages.t09_tendril(codfw, eqiad) Update Tendril configuration for the new masters [11:37:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:38] !log switchdc (volans@sarin) END TASK - switchdc.stages.t09_tendril(codfw, eqiad) Successfully completed [11:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:50] !log switchdc (volans@sarin) START TASK - switchdc.stages.t09_start_maintenance(codfw, eqiad) Start MediaWiki maintenance in the new master DC [11:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:56] !log switchdc (volans@sarin) END TASK - switchdc.stages.t09_start_maintenance(codfw, eqiad) Successfully completed [11:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:17] PROBLEM - ElasticSearch health check for shards on logstash1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 72 threshold =0.1% breach: status: yellow, number_of_nodes: 5, unassigned_shards: 72, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 72, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as [11:41:17] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch inactive shards 72 threshold =0.1% breach: status: yellow, number_of_nodes: 5, unassigned_shards: 72, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 72, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as [11:41:37] PROBLEM - ElasticSearch health check for shards on logstash1002 is CRITICAL: CRITICAL - elasticsearch inactive shards 72 threshold =0.1% breach: status: yellow, number_of_nodes: 5, unassigned_shards: 72, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 72, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as [11:43:17] moritzm: ^^^ [11:44:17] RECOVERY - ElasticSearch health check for shards on logstash1003 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 72, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 99.537037037 [11:44:18] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 72, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 99.537037037 [11:44:19] all shards are started now [11:44:32] ok, great [11:44:37] RECOVERY - ElasticSearch health check for shards on logstash1002 is OK: OK - elasticsearch status production-logstash-eqiad: status: yellow, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 72, task_max_waiting_in_queue_millis: 0, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards_percent_as_number: 99.537037037 [11:46:37] dcausse: o/ [11:46:46] elukey: hi! [11:46:58] that was a normal scheduled reboot of a logstash ES node, if that triggers an icinga alert, we should tweak the threshold for logstash/es, they're probably more geared towards elastic* [11:47:22] moritzm: did you restart all nodes at the same time? [11:47:47] dcausse: no, just 1004 [11:47:56] hm... that is not normal :/ [11:48:12] 1001-1003 were rebooted sequentlly earlier before, but they hold no data IIRC [11:48:19] yes [11:48:34] and they had already turned green when I logged into the hosts to check the status [11:49:09] ok, so maybe it's just like you say: tuning icinga threshold [11:49:21] dcausse: I remember we had that phenomenon in the past already (and also that it was fixed/changed), I can dig in my mails/IRC logs [11:49:42] moritzm: for logstash or the search cluster? [11:49:45] the last round of logstash* reboots certainly didn't trigger this [11:49:48] for logstash only [11:49:50] ok [11:50:09] ^ gehel: does that ring a bell [11:50:19] (03CR) 10Volans: [C: 04-1] "I agree with the approach, but have some comments inline." (034 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/348703 (owner: 10Giuseppe Lavagetto) [11:50:30] gehel: is out for while, should be back soon [11:51:15] ok, there's no hurry. I'll withhold the reboots of 1005/1006 so that these are available for diagnostics in case we want to confirm some hypothesis [11:53:22] moritzm: a 0.1% threshold with 3 data nodes does not make sense to me, but I may overlook something [12:04:04] yeah, 0.1% is the default value of the plugin, maybe this simply needs a saner config setting [12:09:59] and it looks like more 10% (maybe the way we display this percentage in icinga is not right) [12:10:12] at least 34% for logstash would make more sense [12:12:59] (03CR) 10Hashar: "I gave it a look last week. On Debian Jessie that fails to include jessie-backports but I can't find out what is happening there :-\" [puppet] - 10https://gerrit.wikimedia.org/r/345866 (owner: 10Gehel) [12:13:18] !log upgraded mw1261 to HHVM 3.18.2+wmf2 [12:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:31] godog: the script seems to have run just fine, but I still seem to get "StoreFileOp failed (batch #75e5u5xi8byxzi74svb8w9fuxjxp0qg): {"src":"C:\\Windows\\Temp\\phpC526.tmp","dst":"mwstore://local-swift-local/local-public/8/88/Guitar.jpg","overwrite":true,"headers":[],"dstExists":false,"failedAction":"attempt"}" on upload attempt [12:19:00] (03CR) 10BBlack: [C: 031] Revert "cache_upload: lower keep from 3d to 1d on upload backends" [puppet] - 10https://gerrit.wikimedia.org/r/348698 (https://phabricator.wikimedia.org/T162035) (owner: 10Ema) [12:19:17] (03CR) 10BBlack: [C: 04-1] "Let's hold for post-switchover" [puppet] - 10https://gerrit.wikimedia.org/r/343845 (https://phabricator.wikimedia.org/T124954) (owner: 10Ema) [12:20:14] (03CR) 10BBlack: [C: 031] Revert "cache_upload: override CT updates on 304s" [puppet] - 10https://gerrit.wikimedia.org/r/348699 (https://phabricator.wikimedia.org/T162035) (owner: 10Ema) [12:20:43] 06Operations, 10netops: analytics hosts frequently tripping 'port utilization threshold' librenms alerts - https://phabricator.wikimedia.org/T133852#3189553 (10ayounsi) a:03ayounsi [12:32:47] !log upgrading labnodepool1001 to Linux 4.9 [12:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:45] (03CR) 10Giuseppe Lavagetto: Separate the logic to stop videoscalers as they run upstart (034 comments) [switchdc] - 10https://gerrit.wikimedia.org/r/348703 (owner: 10Giuseppe Lavagetto) [12:51:25] (03PS2) 10Alexandros Kosiaris: service::node: change logrotate parameters [puppet] - 10https://gerrit.wikimedia.org/r/232722 [12:52:12] (03CR) 10Alexandros Kosiaris: "Removing my -1, dependent task has been resolved, let's finally merge this after all this time" [puppet] - 10https://gerrit.wikimedia.org/r/232722 (owner: 10Alexandros Kosiaris) [12:53:20] (03PS2) 10Giuseppe Lavagetto: Separate the logic to stop videoscalers as they run upstart [switchdc] - 10https://gerrit.wikimedia.org/r/348703 [12:54:57] (03PS6) 10Ema: Release pybal 1.13.6 [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/348066 (https://phabricator.wikimedia.org/T103882) [12:55:40] (03CR) 10Ema: Release pybal 1.13.6 (031 comment) [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/348066 (https://phabricator.wikimedia.org/T103882) (owner: 10Ema) [12:57:44] (03PS1) 10Alexandros Kosiaris: zotero: Fix logrotate [puppet] - 10https://gerrit.wikimedia.org/r/348713 [12:57:47] (03CR) 10Alexandros Kosiaris: [C: 032] service::node: change logrotate parameters [puppet] - 10https://gerrit.wikimedia.org/r/232722 (owner: 10Alexandros Kosiaris) [12:58:03] (03PS2) 10Alexandros Kosiaris: zotero: Fix logrotate [puppet] - 10https://gerrit.wikimedia.org/r/348713 [12:58:09] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] zotero: Fix logrotate [puppet] - 10https://gerrit.wikimedia.org/r/348713 (owner: 10Alexandros Kosiaris) [13:02:54] 06Operations, 10ops-esams, 10netops: esams higher than usual temperature - https://phabricator.wikimedia.org/T162152#3189612 (10ayounsi) a:03ayounsi Looking at the graphs, now that the FPC has been replaced, the temperature is much more regular. Also the temperature variations seemed to match Amsterdam's w... [13:04:07] PROBLEM - Host ripe-atlas-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [13:05:37] (03CR) 10BBlack: [C: 031] Release pybal 1.13.6 [debs/pybal] (1.13) - 10https://gerrit.wikimedia.org/r/348066 (https://phabricator.wikimedia.org/T103882) (owner: 10Ema) [13:10:32] (03PS1) 10Alexandros Kosiaris: service::node: change logrotate parameters [puppet] - 10https://gerrit.wikimedia.org/r/348717 [13:10:45] moritzm: the icinga alert on logstash is probably because a full reboot takes more time than the elasticsearch restarts that I normally do. [13:12:17] PROBLEM - puppet last run on mc1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:12:30] moritzm: we alert on a percentage of unassigned shards. When we loose a node, 1/3 of the shards gets unassigned. If just elasticsearch is restarted, those shards recover faster than the icinga takes to alert. With a full reboot, we are probably just outside the window... [13:15:03] but we should use a higher threshold for logstash than elastic*, the logstash/ES cluster in yellow with one node down and recovering isn't critical per se [13:15:45] maybe... [13:16:45] Yeah, on logstash, we have maxed out the number of replicas, so if a node goes down, there is no way to reassign those shards. So a threshold of 34% might make sense [13:16:57] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 262 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [13:17:17] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 395 probes of 433 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [13:17:27] It just means we might not catch an issue where some shards are left unassigned for another reason than a node being down... [13:17:45] we certainly ran into that problem with logstash* reboots before, but I was under the impression we had fixed that already. but maybe I'm mixing this up with something else [13:17:52] * gehel can't think of what those reasons might be [13:18:37] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 150 probes of 284 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [13:18:37] No, last time it was a different issue, we had shards that were left unassigned after the node came back up. I can't remember what the reason was, but we had to do a manual step to recover them [13:18:57] PROBLEM - IPv4 ping to ulsfo on ripe-atlas-ulsfo is CRITICAL: CRITICAL - failed 115 probes of 451 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [13:19:57] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 745220 [13:20:55] just thinking out loud here, but couldn't you make icinga ignore the nodes when reboots occour? [13:21:20] gehel: right, that was what I had in mind [13:21:44] Zppix: this is a cluster wide check, the node being rebooted is silenced in icinga. [13:21:57] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 15 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [13:22:17] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 2 probes of 433 (alerts on 19) - https://atlas.ripe.net/measurements/1791210/#!map [13:22:30] Ah, I see. [13:22:52] Zppix: when a node is rebooted, all the shards hosted by that node become unavailable. As we have 3 data nodes, 1/3 of the shards suddenly disappear (unassigned). Our check is on the percentage of unassigned shards. [13:23:37] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 15 probes of 284 (alerts on 19) - https://atlas.ripe.net/measurements/1791309/#!map [13:23:57] RECOVERY - IPv4 ping to ulsfo on ripe-atlas-ulsfo is OK: OK - failed 5 probes of 451 (alerts on 19) - https://atlas.ripe.net/measurements/1791307/#!map [13:23:57] gehel: a if statement stating If node reboot is occuring then check for 2/3 of shards unassigned? [13:24:41] Zppix: yes, we could have that kind of more complex logic. [13:25:26] I mean i've personally not messed with icinga 1 but i have with icinga 2 a bit, and it couldn't be the hard to implement no? [13:25:29] Or we could add LVS in front of that cluster, have a single cluster wide check (we have multiple times the same check at the moment) and change the procedure so that this check is also disabled during node restart [13:26:52] gehel: problem with doing that though is if you disable the check entirely then if the other nodes for some other reason stop working then more than likely no-one would be alerted to it [13:28:33] yes, but it is for a short time (minutes) during which the person doing the maintenance should have a close look at what is going on anyway. [13:29:22] I'm not a big fan of more complex checks because it is too easy to get the logic slightly wrong in case of unexpected failure modes. [13:29:59] (03CR) 10Volans: [C: 031] "LGTM, let's test it today :)" [switchdc] - 10https://gerrit.wikimedia.org/r/348703 (owner: 10Giuseppe Lavagetto) [13:30:04] gehel: I mean run a trial with the LVS idea and if that works okay then stick with it otherwise complex may be the way to go. [13:30:36] Zppix: yep, might be... we'll see! [13:30:52] (03CR) 10Giuseppe Lavagetto: [C: 032] Separate the logic to stop videoscalers as they run upstart [switchdc] - 10https://gerrit.wikimedia.org/r/348703 (owner: 10Giuseppe Lavagetto) [13:31:22] addshore: gah, any other message? [13:31:26] gehel: the only issue besides human error with the complex i see happening is when/if the upgrade to icinga 2 happens [13:33:12] Zppix: you mean you'd link checks together in icinga? Have the number of unassigned shards that we check be dependent on the number of nodes in error? [13:34:44] gehel: well not exactly what i was saying was, when a node is rebooted by an operations then increase the threshold to 2/3 else set it to 1/3 [13:35:16] more of a modification the current check i guess [13:35:27] !log switchdc (oblivian@sarin) START TASK - switchdc.stages.t01_stop_maintenance(codfw, eqiad) Stop MediaWiki maintenance in the old master DC [13:35:31] !log switchdc (oblivian@sarin) END TASK - switchdc.stages.t01_stop_maintenance(codfw, eqiad) Failed to execute [13:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:12] so instead of disabling the check during maintenance, change its threshold? I don't think this is gaining much over disabling the check during maintenance... [13:36:59] 06Operations, 10Monitoring, 07LDAP, 13Patch-For-Review: allow paging to work properly in ldap - https://phabricator.wikimedia.org/T162745#3189712 (10MoritzMuehlenhoff) @bd808 : The current limits setting applies to authenticated users, while your script makes an anonymous LDAP bind, see http://www.openldap... [13:37:23] gehel: change its threshold automatically via like script when maintaince is occuring [13:37:48] would be nice if elastic provided an api endpoint to tell "last time the cluster was green" [13:38:11] yes, but starting maintenance is a manual operation, so even if scripted, that script is launched by a human [13:38:57] gehel: i'd assume you issue a command when starting maintaince so you make it the where said command would issue script to change threshold [13:39:05] the shard check would be set to 34%, and the "last time in green" check would help to catch some shards inconsistencies (cluster config issues) [13:39:07] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 631 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 2865591 keys, up 25 days 21 hours - replication_delay is 631 [13:39:58] dcausse: and green in the generic sense is not a very good indicator (which is why we use a %-age of unassigned shards). On the cirrus cluster, during reindex, you can go a fairly long time without the cluster being green... [13:40:17] RECOVERY - puppet last run on mc1031 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures [13:41:07] irregardless let's try out your idea with LVS and see if that works out with an unbearable amount of issues gehel [13:41:47] * gehel is going to admit is lack of knowledge about icinga... [13:42:06] gehel: yes very true, but if the cluster has not been green for more that 3 days I'd suspect a config issues there [13:42:32] welcome to the club gehel lol [13:42:54] dcausse: i see issues arising with that idea though. [13:43:02] Zppix: so it seems it is possible to change a check configuration by API. How does it interract with on disk configuration changes? [13:43:29] since our config is puppet managed and is subject to be reloaded during a maintenance window... [13:46:47] PROBLEM - Check Varnish expiry mailbox lag on cp1073 is CRITICAL: CRITICAL: expiry mailbox lag is 598122 [13:47:54] 06Operations: reinstall rcs100[12] with RAID - https://phabricator.wikimedia.org/T140441#3189742 (10Ottomata) The plan is to decommission these in July. [13:48:19] gehel: make it a requirement to issue the script to change it before disk config changes [13:48:35] puppet is capable of that no? [13:49:07] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 2841451 keys, up 25 days 21 hours - replication_delay is 41 [13:49:25] (03CR) 10Ottomata: "Hm." [puppet] - 10https://gerrit.wikimedia.org/r/348669 (https://phabricator.wikimedia.org/T163177) (owner: 10EBernhardson) [13:49:42] Zppix: I mean that during a logstash maintenance, we might have a totally unrelated change that will change icinga config and reload it. So if there are changes to config done in multiple ways, we are going to get lost at some point. [13:50:43] !log switchdc (oblivian@sarin) START TASK - switchdc.stages.t09_restart_parsoid(codfw, eqiad) Rolling restart parsoid in eqiad and codfw [13:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:53] gehel: Has that every happened though? [13:51:40] Zppix: we don't have that issue atm, so I dont know for sure, but I suspect it happens fairly often [13:53:01] gehel: if icinga is reloaded for some reason i suspect that but the time it runs the checks again logstash will be back to normal operation [13:54:02] by* [13:57:59] godog: not that I can clearly see, how well do you know the code? Are there any bits I could / should add extra logging to to find out more? If not I'll just keep poking it a bit after my next meeting and see if i can gain any more [13:59:07] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 641 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 2841451 keys, up 25 days 21 hours - replication_delay is 641 [13:59:15] 06Operations, 10ops-codfw, 13Patch-For-Review, 15User-fgiunchedi: codfw: ms-be2028-ms-be2039 rack/setup - https://phabricator.wikimedia.org/T158337#3189770 (10fgiunchedi) 05Open>03Resolved This is completed, decom for equivalent old hw is {T162785} [14:00:51] addshore: heh not very well, tcpdump'ing the requests to switch might shed some light tho [14:02:24] 06Operations, 13Patch-For-Review, 05Prometheus-metrics-monitoring, 15User-fgiunchedi: Put prometheus baremetal servers in service - https://phabricator.wikimedia.org/T148408#3189788 (10fgiunchedi) 05Open>03Resolved This is completed, baremetal in service [14:04:35] !log executed CONFIG SET appendfsync no on rdb2005:6479 to test if fsync stalls affect replication - T159850 [14:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:42] T159850: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850 [14:05:59] reverting the behavior is super easy and it will be done in maximum 1|2 hours, rdb2005 will become primary tomorrow and it will need to fsync [14:08:43] !log switchdc (oblivian@sarin) END TASK - switchdc.stages.t09_restart_parsoid(codfw, eqiad) Successfully completed [14:08:43] (03CR) 10Alexandros Kosiaris: [C: 04-1] "reload does not really work for the services right now as they don't specify an ExecReload action in their systemd units. Specifying one i" [puppet] - 10https://gerrit.wikimedia.org/r/348717 (owner: 10Alexandros Kosiaris) [14:08:45] atm it seems hitting output-buffers soft limits (buffer kept with ~500Mb of data for more than 60s) [14:08:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:11] (03PS2) 10BBlack: traffic: a/p services switch to temporary a/a [puppet] - 10https://gerrit.wikimedia.org/r/347852 [14:10:21] (03PS2) 10BBlack: traffic: a/a services switch to codfw-only [puppet] - 10https://gerrit.wikimedia.org/r/347853 [14:11:48] !log executed CONFIG SET appendfsync everysec (default) to restore defaults on rdb2005:6479- T159850 [14:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:55] T159850: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850 [14:14:04] !log oblivian: Setting mobileapps in eqiad DOWN [14:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:21] (03PS1) 10Urbanecm: Raise requirements for getting autoconfirmed status to 4 days, 10 edits at cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348725 (https://phabricator.wikimedia.org/T163207) [14:18:31] gehel,dcausse: I'll resume the logstash reboots with 1005/1006, then (I've silenced the "ElasticSearch health check for shards" check for logstash) [14:18:37] (03PS1) 10BBlack: ores.wm.o: active/active backend [puppet] - 10https://gerrit.wikimedia.org/r/348726 [14:18:43] moritzm: good! [14:18:59] moritzm: I'll keep an eye on cluster health as well... [14:19:13] gehel: Ill looking the LVS thing and see if I can't figure out that for you [14:19:17] into* [14:19:29] Zppix: thanks! [14:20:52] (03PS1) 10Urbanecm: Make sysops able to grant/remove confirmed user group at cswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348727 (https://phabricator.wikimedia.org/T163206) [14:21:09] !log oblivian: Setting mobileapps in eqiad UP [14:21:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:13] !log executed config set client-output-buffer-limit "normal 0 0 0 slave 2147483648 2147483648 300 pubsub 33554432 8388608 60" on rdb2005:6749 as attempt to solve slave lagging - T159850 [14:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:19] T159850: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850 [14:23:04] <_joe_> can I ask everyone to stop merging / doing changes? the services switchover is coming up [14:23:19] (03CR) 10Alexandros Kosiaris: [C: 031] ores.wm.o: active/active backend [puppet] - 10https://gerrit.wikimedia.org/r/348726 (owner: 10BBlack) [14:23:30] _joe_: last one here in preparation for tomorrow... [14:23:46] _joe_ ack, I am now only watching logs [14:23:47] pulling back elastic2020 into the cluster after a chat with Papaul... [14:23:49] <_joe_> gehel: it should wait ~ 20 minutes [14:24:10] (03CR) 10BBlack: [C: 032] ores.wm.o: active/active backend [puppet] - 10https://gerrit.wikimedia.org/r/348726 (owner: 10BBlack) [14:24:41] _joe_: not a puppet change, just a cluster setting on elasticsearch, should it also wait 20' ? [14:25:13] <_joe_> gehel: go on then :) [14:25:15] (03PS1) 10BBlack: stream.wm.o eventstreams backend active/active [puppet] - 10https://gerrit.wikimedia.org/r/348734 [14:25:42] (03CR) 10Giuseppe Lavagetto: [C: 031] stream.wm.o eventstreams backend active/active [puppet] - 10https://gerrit.wikimedia.org/r/348734 (owner: 10BBlack) [14:25:44] !log un-ban elastic2020 to get ready for real-life test during switchover - T149006 [14:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:53] T149006: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006 [14:26:02] * gehel is going on then... [14:26:05] (03CR) 10Ottomata: [C: 031] stream.wm.o eventstreams backend active/active [puppet] - 10https://gerrit.wikimedia.org/r/348734 (owner: 10BBlack) [14:27:12] (03PS2) 10BBlack: stream.wm.o eventstreams backend active/active [puppet] - 10https://gerrit.wikimedia.org/r/348734 [14:27:21] (03CR) 10BBlack: [V: 032 C: 032] stream.wm.o eventstreams backend active/active [puppet] - 10https://gerrit.wikimedia.org/r/348734 (owner: 10BBlack) [14:30:42] (03PS3) 10BBlack: traffic: a/p services switch to temporary a/a [puppet] - 10https://gerrit.wikimedia.org/r/347852 [14:30:44] (03PS3) 10BBlack: traffic: a/a services switch to codfw-only [puppet] - 10https://gerrit.wikimedia.org/r/347853 [14:31:14] <_joe_> bblack: I guess it's ok to proceed? [14:31:15] 06Operations, 13Patch-For-Review, 15User-Elukey: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850#3189861 (10elukey) Now that I put more thoughts on it, the client-output-buffer will need to cope with the amount of data that will take to rdb100X:YYYY to send a rd... [14:31:47] RECOVERY - Elasticsearch HTTPS on elastic2020 is OK: SSL OK - Certificate elastic2020.codfw.wmnet valid until 2022-04-17 14:30:10 +0000 (expires in 1824 days) [14:32:33] _joe_: yes :) [14:32:57] PROBLEM - Check Varnish expiry mailbox lag on cp1099 is CRITICAL: CRITICAL: expiry mailbox lag is 695052 [14:32:57] RECOVERY - Check systemd state on elastic2020 is OK: OK - running: The system is fully operational [14:32:59] <_joe_> !log starting switchover of services eqiad => codfw; external traffic will be switched over, as well as internal traffic to restbase [14:33:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:46] <_joe_> mobrovac: proceeding, I'm killing internal traffic (non-CP, non-varnish) to restbase in eqiad [14:33:51] !log oblivian: Setting restbase in eqiad DOWN [14:33:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:00] <_joe_> done [14:34:04] <_joe_> :) [14:34:10] k [14:34:13] cool [14:34:17] <_joe_> now I'll switch traffic over, and then move restbase-async to eqiad [14:34:29] ok [14:34:32] <_joe_> (restbase-async being the cp-only host) [14:36:11] (03CR) 10Giuseppe Lavagetto: [C: 032] traffic: a/p services switch to temporary a/a [puppet] - 10https://gerrit.wikimedia.org/r/347852 (owner: 10BBlack) [14:36:37] PROBLEM - Disk space on ms-be1002 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdh1 is not accessible: Input/output error [14:37:27] I'll take a look at that ^ [14:38:08] <_joe_> !log forcing puppet run on caches for catching up with the a/a setting of maps and restbase [14:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:37] RECOVERY - Disk space on ms-be1002 is OK: DISK OK [14:42:30] (03CR) 10Giuseppe Lavagetto: [C: 032] traffic: a/a services switch to codfw-only [puppet] - 10https://gerrit.wikimedia.org/r/347853 (owner: 10BBlack) [14:42:49] <_joe_> mobrovac: ^^ this completes the switch of varnish traffic [14:43:01] kk [14:43:52] <_joe_> !log switching traffic for all a/a services plus maps and restbase to codfw-only [14:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:57] gehel: logstash reboots completed, cluster back to green [14:47:54] moritzm: yep, i saw that. Thanks! [14:48:27] PROBLEM - MegaRAID on ms-be1002 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) [14:48:38] ACKNOWLEDGEMENT - MegaRAID on ms-be1002 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T163209 [14:48:41] 06Operations, 10ops-eqiad: Degraded RAID on ms-be1002 - https://phabricator.wikimedia.org/T163209#3189898 (10ops-monitoring-bot) [14:52:21] 06Operations, 10ops-eqiad, 10media-storage: Degraded RAID on ms-be1002 - https://phabricator.wikimedia.org/T163209#3189909 (10Volans) p:05Triage>03Normal [14:54:21] 06Operations, 10ops-eqiad, 10media-storage: Degraded RAID on ms-be1002 - https://phabricator.wikimedia.org/T163209#3189915 (10fgiunchedi) a:03Cmjohnson Confirmed `sdh` isn't well. @Cmjohnson do you have spares onsite? [14:54:26] volans: thanks for fixing the output in ^ [14:54:30] !log oblivian: Setting restbase-async in eqiad UP [14:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:46] !log oblivian: Setting restbase-async in codfw DOWN [14:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:13] <_joe_> !log switchover of services, misc things done [14:55:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:18] PROBLEM - carbon-cache too many creates on graphite1001 is CRITICAL: CRITICAL: 1.69% of data above the critical threshold [1000.0] [14:55:20] 06Operations, 13Patch-For-Review: acpi_pad issues - https://phabricator.wikimedia.org/T162850#3189920 (10MoritzMuehlenhoff) [14:55:20] godog: yw, we need to find another way, swift be are too slow for NRPE timeouts [14:55:22] 06Operations: acpi_pad consuming 100% CPU on tin - https://phabricator.wikimedia.org/T163158#3189918 (10MoritzMuehlenhoff) 05Resolved>03Open The "Improperly owned -0:0- files in /srv/mediawiki-staging" Icinga check was failing on tin, caused by a timeout of completing the check in time. It turns out tin is c... [14:57:14] (03CR) 10ArielGlenn: "Can we link to the meta page about the dumps instead? Wikitech might be and has been reorganized from time to time but meta should be pre" [dumps] - 10https://gerrit.wikimedia.org/r/347906 (owner: 10Awight) [14:58:51] (03CR) 10ArielGlenn: "looks good, will merge after tomorrow's deploy of bug fixes" [dumps] - 10https://gerrit.wikimedia.org/r/347908 (owner: 10Awight) [14:59:12] (03CR) 10ArielGlenn: "looks good, will merge after tomorrow's deploy of bugfixes" [dumps] - 10https://gerrit.wikimedia.org/r/347907 (owner: 10Awight) [15:00:16] (03PS2) 10Filippo Giunchedi: traffic: swift temporary a/a [puppet] - 10https://gerrit.wikimedia.org/r/347859 (owner: 10BBlack) [15:00:35] bblack: merging ^ [15:00:59] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] traffic: swift temporary a/a [puppet] - 10https://gerrit.wikimedia.org/r/347859 (owner: 10BBlack) [15:02:11] !log upgrading elastic2020 to elasticsearch 5.1.2 [15:02:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:49] !log puppet-run on cache_upload in codfw/eqiad to pick up switch a/a changes [15:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:19] (03CR) 10ArielGlenn: "Fine to add the existence check but can you explain a bit more the problem you're trying to solve?" [dumps] - 10https://gerrit.wikimedia.org/r/348011 (owner: 10Awight) [15:03:24] of course I mistyped switch/swift [15:05:18] run completed, switching to codfw [15:05:51] (03PS2) 10Filippo Giunchedi: traffic: swift a/p in codfw only [puppet] - 10https://gerrit.wikimedia.org/r/347860 (owner: 10BBlack) [15:06:57] (03CR) 10Filippo Giunchedi: [V: 032 C: 032] traffic: swift a/p in codfw only [puppet] - 10https://gerrit.wikimedia.org/r/347860 (owner: 10BBlack) [15:08:16] !log puppet-run on cache_upload in codfw/eqiad to pick up swift a/p changes [15:08:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:44] puppet run finished, I'm looking at https://grafana.wikimedia.org/dashboard/db/prometheus-cluster-breakdown?orgId=1&from=now-30m&to=now&var-datasource=codfw%20prometheus%2Fops&var-cluster=swift&var-instance=All&refresh=1m [15:11:00] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 2841014 keys, up 25 days 22 hours - replication_delay is 46 [15:12:10] and of course https://grafana.wikimedia.org/dashboard/file/swift.json?var-DC=codfw&from=now-3h&to=now-1m&orgId=1 [15:12:51] RECOVERY - Check Varnish expiry mailbox lag on cp1099 is OK: OK: expiry mailbox lag is 453 [15:12:53] the redis recovery is a lie, it will trip again shortly probably [15:12:57] 06Operations, 07HHVM: Frequent TCP RST on connections between HHVM and Redis - https://phabricator.wikimedia.org/T162354#3190006 (10MoritzMuehlenhoff) Still happens on mw1261 with a HHVM build including the fix [15:14:52] (03CR) 10ArielGlenn: "Pending testing, this lgtm. The cost of the second call out to php is neglible compared to everything else. Also I had no idea there was " [dumps] - 10https://gerrit.wikimedia.org/r/348002 (owner: 10Awight) [15:16:40] RECOVERY - Check Varnish expiry mailbox lag on cp1073 is OK: OK: expiry mailbox lag is 4474 [15:17:11] It seems that Redis is not stablish [15:18:28] swift eqiad drained, all traffic onto swift codfw [15:19:01] * robh wonders when he should push the new codfw deploy host into service, but its not now..... [15:20:38] !log restored default output-buffer config for rdb2005:6479 [15:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:45] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 2 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3190054 (10Papaul) Last update report: - Removed the original disks from the server and put in 2 identical spare disks only difference was the... [15:41:05] <_joe_> robh: are you working on the new deployment host in codfw? [15:41:20] <_joe_> did you verify its redis instance is correctly replicating from tin? [15:42:33] (03PS2) 10Ema: pybal: bind instrumentation TCP port to private addresses [puppet] - 10https://gerrit.wikimedia.org/r/348074 (https://phabricator.wikimedia.org/T103882) [15:42:53] <_joe_> ema: you're not going to merge/deploy that this week, right? [15:43:24] _joe_: nope! :) [15:45:13] 06Operations, 10Monitoring, 10netops: nagios monitor transit/peering links and alert on low/high traffic - https://phabricator.wikimedia.org/T80273#3190110 (10ayounsi) a:03ayounsi Not sure if still relevant seeing how old the ticket is, but: High traffic: already being monitored by LibreNMS Low traffic... [15:45:32] _joe_: its not had its latest merge [15:45:43] I was awaiting the return of those who know the deployment host better [15:45:53] it has a patchset for most of them, let me add you to reviewer [15:46:06] https://gerrit.wikimedia.org/r/#/c/348478/ [15:46:22] I've not done anything on its redis setup. [15:46:22] robh: yeah, I pinged godog about it earlier [15:46:37] I didn't want to half merge things and break it or cause oddness to deploys [15:46:40] so I held off on merge [15:46:57] but its otherwise ready to be pushed into service and checked [15:48:00] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 627 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 2844047 keys, up 25 days 23 hours - replication_delay is 627 [15:48:20] 06Operations, 13Patch-For-Review, 15User-Elukey: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850#3190116 (10elukey) ``` elukey@rdb2005:~$ redis-cli -a "$(sudo grep -Po '(?<=masterauth ).*' /etc/redis/tcp_6380.conf)" -p 6479 --bigkeys -i 0.1... [15:49:20] 06Operations, 10Monitoring, 07LDAP, 13Patch-For-Review: allow paging to work properly in ldap - https://phabricator.wikimedia.org/T162745#3190119 (10bd808) To auth, change the test program from T162745#3179299 with this patch and set USER and PASS as approriate: ``` name=add-auth.patch --- paged-ldap.py.o... [15:57:01] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 2841442 keys, up 25 days 23 hours - replication_delay is 0 [15:57:42] (03PS1) 10Andrew Bogott: Shinken: Add monitoring for the labs dhcp server. [puppet] - 10https://gerrit.wikimedia.org/r/348745 (https://phabricator.wikimedia.org/T162956) [15:58:46] (03PS2) 10Andrew Bogott: Shinken: Add monitoring for the labs dhcp server. [puppet] - 10https://gerrit.wikimedia.org/r/348745 (https://phabricator.wikimedia.org/T162956) [15:58:56] !log mobrovac@tin Started deploy [restbase/deploy@960b468]: Dev Cluster: Blacklist an enwiki and a commons page [15:59:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:32] !log mobrovac@tin Finished deploy [restbase/deploy@960b468]: Dev Cluster: Blacklist an enwiki and a commons page (duration: 01m 42s) [16:00:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:20] RECOVERY - carbon-cache too many creates on graphite1001 is OK: OK: Less than 1.00% above the threshold [500.0] [16:01:27] !log mobrovac@tin Started deploy [restbase/deploy@960b468]: Blacklist an enwiki and a commons page [16:01:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:41] 06Operations, 10ops-codfw: Degraded RAID on heze - https://phabricator.wikimedia.org/T163087#3190163 (10Volans) [16:03:10] PROBLEM - puppet last run on nescio is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:03:23] 06Operations, 10ops-codfw: Degraded RAID on heze - https://phabricator.wikimedia.org/T163087#3185569 (10Volans) @Dzahn I've updated the output with the result of `sudo /usr/local/lib/nagios/plugins/get-raid-status-megacli` (you can get the right one arriving at `get-` and pressing tab to know which one is avai... [16:07:10] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:07:10] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:07:36] (03CR) 10Chad: "Inline comments. Once this lands, we can clean up the leftovers from the debian package manually from the machine, then clean up the debia" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/348690 (https://phabricator.wikimedia.org/T163185) (owner: 10Muehlenhoff) [16:08:10] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [16:08:45] 06Operations, 10netops: Filter outgoing BGP announcements on AS regex - https://phabricator.wikimedia.org/T83037#908416 (10ayounsi) Slightly different, it is possible to use the configuration statement "remove-private" to achieve a similar goal. It's not as strict as specifying the allowed AS# but might be a g... [16:09:10] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [16:09:36] !log mobrovac@tin Finished deploy [restbase/deploy@960b468]: Blacklist an enwiki and a commons page (duration: 08m 16s) [16:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:10:00] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 617 [16:11:26] (03PS1) 10Cmjohnson: Adding production dns entries for db servers T162233 [dns] - 10https://gerrit.wikimedia.org/r/348750 [16:12:27] !log reboot tin to fix cpu mhz issue and check bios settings - T163158 [16:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:34] T163158: acpi_pad consuming 100% CPU on tin - https://phabricator.wikimedia.org/T163158 [16:13:51] 06Operations, 10Monitoring, 10netops: nagios monitor transit/peering links and alert on low/high traffic - https://phabricator.wikimedia.org/T80273#3190196 (10faidon) Yes, it does. Thanks for working on such an old task! For (1) of your list: - cr1-ulsfo.wikimedia.org:xe-0/0/3.98: this doesn't seem to be co... [16:13:51] (03CR) 10Cmjohnson: [C: 032] Adding production dns entries for db servers T162233 [dns] - 10https://gerrit.wikimedia.org/r/348750 (owner: 10Cmjohnson) [16:15:01] !log reimporting some rows to dbstore1002 on jawiki and ruwiki T160509 [16:15:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:15:10] T160509: run pt-tablechecksum on s6 - https://phabricator.wikimedia.org/T160509 [16:20:00] (03CR) 10BBlack: [C: 04-1] "On hold in general now until post codfw-switchover" [puppet] - 10https://gerrit.wikimedia.org/r/345591 (owner: 10BBlack) [16:20:16] (03PS2) 10BBlack: traffic: route esams via codfw [puppet] - 10https://gerrit.wikimedia.org/r/347613 [16:20:20] PROBLEM - puppet last run on mw1300 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[jobrunner] [16:21:25] !log starting Traffic-layer portions of codfw switchover ( https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Switchover_2 ) [16:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:40] PROBLEM - puppet last run on mw2247 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[jobrunner] [16:22:00] (03CR) 10BBlack: [V: 032 C: 032] traffic: route esams via codfw [puppet] - 10https://gerrit.wikimedia.org/r/347613 (owner: 10BBlack) [16:22:20] PROBLEM - puppet last run on mw1166 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[jobrunner] [16:22:51] I'm guessing mw puppetfails are related to "Adding production dns entries for db servers T162233" ? [16:22:51] T162233: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233 [16:23:10] (03CR) 10Andrew Bogott: [C: 032] Shinken: Add monitoring for the labs dhcp server. [puppet] - 10https://gerrit.wikimedia.org/r/348745 (https://phabricator.wikimedia.org/T162956) (owner: 10Andrew Bogott) [16:23:10] <_joe_> bblack: to tin rebooting [16:23:11] bblack: no [16:23:15] tin down [16:23:15] (03PS3) 10Andrew Bogott: Shinken: Add monitoring for the labs dhcp server. [puppet] - 10https://gerrit.wikimedia.org/r/348745 (https://phabricator.wikimedia.org/T162956) [16:23:18] nice... L( [16:23:22] what? that should not happen [16:23:30] as in, that should not be deployed [16:23:40] could not get latest version [16:23:46] ok [16:24:00] the mw hosts check on tin the lastes version I guess: Connection timed out - connect(2) for "tin.eqiad.wmnet" port 80 [16:24:00] PROBLEM - puppet last run on mw2249 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[jobrunner] [16:24:11] (03PS1) 10Cmjohnson: adding mac address for new db's less db1098--not connecting will add this later T162233 [puppet] - 10https://gerrit.wikimedia.org/r/348755 [16:24:16] this doens't sound a good design to me [16:24:50] (03PS2) 10BBlack: traffic: depool eqiad from user traffic [dns] - 10https://gerrit.wikimedia.org/r/347616 [16:24:50] PROBLEM - puppet last run on mw1304 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[jobrunner] [16:25:10] (03CR) 10BBlack: [C: 032] traffic: depool eqiad from user traffic [dns] - 10https://gerrit.wikimedia.org/r/347616 (owner: 10BBlack) [16:25:41] tin should be back up now btw [16:26:00] thanks godog ! [16:26:20] RECOVERY - puppet last run on mw1166 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [16:26:24] I run puppet on mw1166 [16:26:31] !log completed Traffic-layer portions of codfw switchover ( https://wikitech.wikimedia.org/wiki/Switch_Datacenter#Switchover_2 ) [16:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:47] np volans ! [16:26:50] RECOVERY - puppet last run on mw1304 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [16:27:28] oh, so puppet fails if tin is down? [16:27:33] in different news, I see ripe-atlas-eqiad down since 3.5h, is it known why? [16:28:03] not known to me [16:31:48] yeah it fails because try to connect to tin:80 [16:31:53] (03PS2) 10Muehlenhoff: Move gerrit to using Bouncycastle as packaged by Debian [puppet] - 10https://gerrit.wikimedia.org/r/348690 (https://phabricator.wikimedia.org/T163185) [16:32:03] sounds quite wrong [16:32:10] RECOVERY - puppet last run on nescio is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [16:32:40] probably a cron or service should be started by puppet, but let the failure be handled by icinga [16:37:17] (03PS1) 10Papaul: DHCP: ADD MAC address entries for db20[7-9][0-9] [puppet] - 10https://gerrit.wikimedia.org/r/348758 [16:38:27] (03CR) 10Jcrespo: [C: 04-1] "Typo?" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/348755 (owner: 10Cmjohnson) [16:39:00] PROBLEM - parsoid on wtp1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:39:24] 06Operations, 10netops: Filter outgoing BGP announcements on AS regex - https://phabricator.wikimedia.org/T83037#3190279 (10faidon) Yes, that was my intention as well. The other thing about this that I see on my notes is to `set as-path path 14907` either under the aggregate routes, or under defaults, plus pot... [16:39:50] RECOVERY - parsoid on wtp1009 is OK: HTTP OK: HTTP/1.1 200 OK - 1019 bytes in 0.072 second response time [16:41:30] PROBLEM - parsoid on wtp1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:41:56] (03PS1) 10Filippo Giunchedi: keyholder: create /run/keyholder at boot [puppet] - 10https://gerrit.wikimedia.org/r/348760 [16:42:20] RECOVERY - parsoid on wtp1021 is OK: HTTP OK: HTTP/1.1 200 OK - 1019 bytes in 0.013 second response time [16:42:45] 06Operations, 10Monitoring: Add Icinga check for CPU frequency on Dell R320 - https://phabricator.wikimedia.org/T163220#3190291 (10MoritzMuehlenhoff) [16:43:13] (03PS5) 10Filippo Giunchedi: adds naos to everywhere mira is listed [puppet] - 10https://gerrit.wikimedia.org/r/348478 (owner: 10RobH) [16:44:21] RECOVERY - puppet last run on mw1300 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [16:44:40] RECOVERY - puppet last run on mw2247 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [16:44:49] 06Operations, 13Patch-For-Review, 15User-Elukey: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850#3190325 (10elukey) @cscott Hi! Sorry to bring you in this thread but I am wondering if we could use any of the maintenance scripts listed in https://wikitech.wikimed... [16:45:14] (03PS1) 10Muehlenhoff: Add symlinks for Debian-packaged Bouncycastle Jars [puppet] - 10https://gerrit.wikimedia.org/r/348762 (https://phabricator.wikimedia.org/T163185) [16:45:45] robh: going to merge your naos patch and check things there are ok [16:48:00] RECOVERY - puppet last run on mw2249 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [16:49:10] (03CR) 10Filippo Giunchedi: [C: 032] adds naos to everywhere mira is listed [puppet] - 10https://gerrit.wikimedia.org/r/348478 (owner: 10RobH) [16:49:10] PROBLEM - parsoid on wtp1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:49:40] PROBLEM - parsoid on wtp1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:50:04] godog: woot \o/ [16:50:19] let me know if anything doesnt go smoothly so next time i know what i missed. [16:50:30] PROBLEM - parsoid on wtp1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:50:40] RECOVERY - parsoid on wtp1006 is OK: HTTP OK: HTTP/1.1 200 OK - 1019 bytes in 2.541 second response time [16:50:57] for sure [16:51:00] PROBLEM - parsoid on wtp1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:51:10] RECOVERY - parsoid on wtp1008 is OK: HTTP OK: HTTP/1.1 200 OK - 1019 bytes in 5.680 second response time [16:51:20] RECOVERY - parsoid on wtp1021 is OK: HTTP OK: HTTP/1.1 200 OK - 1019 bytes in 0.008 second response time [16:51:26] (03CR) 10Chad: [C: 031] Move gerrit to using Bouncycastle as packaged by Debian [puppet] - 10https://gerrit.wikimedia.org/r/348690 (https://phabricator.wikimedia.org/T163185) (owner: 10Muehlenhoff) [16:51:37] (03CR) 10Chad: [C: 031] Add symlinks for Debian-packaged Bouncycastle Jars [puppet] - 10https://gerrit.wikimedia.org/r/348762 (https://phabricator.wikimedia.org/T163185) (owner: 10Muehlenhoff) [16:51:39] what's happening to parsoid? [16:51:50] RECOVERY - parsoid on wtp1010 is OK: HTTP OK: HTTP/1.1 200 OK - 1019 bytes in 0.063 second response time [16:52:10] PROBLEM - parsoid on wtp1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:52:10] PROBLEM - parsoid on wtp1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:52:10] PROBLEM - parsoid on wtp1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:52:22] (03PS3) 10BBlack: cache::text: switch all mediawiki to codfw [puppet] - 10https://gerrit.wikimedia.org/r/346320 (owner: 10Giuseppe Lavagetto) [16:52:39] _joe_, mobrovac ^^^ [16:52:53] (03CR) 10BBlack: [C: 031] cache::text: switch all mediawiki to codfw [puppet] - 10https://gerrit.wikimedia.org/r/346320 (owner: 10Giuseppe Lavagetto) [16:53:00] RECOVERY - parsoid on wtp1022 is OK: HTTP OK: HTTP/1.1 200 OK - 1019 bytes in 0.477 second response time [16:53:01] RECOVERY - parsoid on wtp1018 is OK: HTTP OK: HTTP/1.1 200 OK - 1019 bytes in 0.418 second response time [16:53:05] <_joe_> volans: oh it seems this is the effect of changeprop going to eqiad [16:53:10] RECOVERY - parsoid on wtp1013 is OK: HTTP OK: HTTP/1.1 200 OK - 1019 bytes in 6.916 second response time [16:53:10] PROBLEM - parsoid on wtp1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:53:31] <_joe_> so this is not user-related traffic. Still, this is worrying [16:53:44] is codfw more powerful than eqiad for parsoid? [16:53:50] <_joe_> slightly, yes [16:54:10] <_joe_> but we had the same issues on codfw in the past [16:54:26] ok [16:54:32] <_joe_> can someone else debug this? [16:54:49] <_joe_> I'm kinda in the middle of switchover preparations [16:54:59] well it's still going to bog down user-related parsoid traffic right? [16:55:00] RECOVERY - parsoid on wtp1019 is OK: HTTP OK: HTTP/1.1 200 OK - 1019 bytes in 0.042 second response time [16:55:14] <_joe_> bblack: no, why should it? [16:55:20] I have no idea! :) [16:55:30] <_joe_> bblack: user-related traffic is now in codfw [16:55:30] PROBLEM - parsoid on wtp1021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:55:43] damn [16:55:57] those are updates [16:55:59] there's no user-related rb/parsoid traffic in eqiad as some indirect fallout of MW still running there? [16:56:10] PROBLEM - parsoid on wtp1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:56:11] the eqiad parsoid cluster is weaker than the codfw one [16:56:20] RECOVERY - parsoid on wtp1021 is OK: HTTP OK: HTTP/1.1 200 OK - 1019 bytes in 0.295 second response time [16:56:29] WARNING: For config property imgInfo, required a value of type: number [16:56:32] Found undefined; Resetting it to: 40000 [16:56:33] and there are others too [16:56:37] are those "expected"? [16:56:52] they are kind flooding syslog [16:57:01] RECOVERY - parsoid on wtp1018 is OK: HTTP OK: HTTP/1.1 200 OK - 1019 bytes in 0.012 second response time [16:57:20] PROBLEM - Check Varnish expiry mailbox lag on cp2002 is CRITICAL: CRITICAL: expiry mailbox lag is 764837 [16:57:50] <_joe_> volans: those are parsoid threads starting up [16:57:57] <_joe_> so I guess it's being OOMed [16:58:09] <_joe_> not by the OS, but by its memory limits [16:58:52] yup yup [16:59:10] <_joe_> {"name":"parsoid","hostname":"wtp1021","pid":2,"level":50,"message":"worker 1212 stopped sending heartbeats, [16:59:10] RECOVERY - puppet last run on naos is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [16:59:13] <_joe_> killing.","status":["commonswiki/User:Aschroet/Bla?oldid=241281321","frwiki/Camp_de_concentration_de_Yodok?oldid=136160093","frwiki/9_(album_de_Public_Image_Limited)?oldid=133870074","enwiki/Yoon_Sook-ja?oldid=775216348","enwiki/Niranjan_Bhagat?oldid=773007551","frwiki/Bachia_oxyrhina?oldid=121992102","frwiki/Tramway_de_Most_-_Litvínov?oldid=127880265"],"levelPath":"error/service-runner/master [16:59:19] <_joe_> ","msg":"worker 1212 stopped sending heartbeats, killing.","time":"2017-04-18T16:59:01.280Z","v":0} [16:59:27] worker restarts are spiking [16:59:41] and those are also duplicated under /srv/log/parsoid/syslog.log fwiw [17:00:00] PROBLEM - parsoid on wtp1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:00:31] I guess I should wait a sec before restarting nutcracker :) [17:00:50] RECOVERY - parsoid on wtp1010 is OK: HTTP OK: HTTP/1.1 200 OK - 1019 bytes in 0.606 second response time [17:00:50] RECOVERY - Keyholder SSH agent on naos is OK: OK: Keyholder is armed with all configured keys. [17:00:52] Heap memory limit temporarily exceeded","limit":629145600 [17:01:27] definitely OOMing [17:02:10] PROBLEM - parsoid on wtp1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:03:00] RECOVERY - parsoid on wtp1008 is OK: HTTP OK: HTTP/1.1 200 OK - 1019 bytes in 0.027 second response time [17:03:10] PROBLEM - parsoid on wtp1013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:03:30] it hit an edge-case - commonswiki/User:Aschroet/Bla?oldid=241281321 [17:03:31] (03PS4) 10Dzahn: monitoring: add timeout parameter to bad_directory_owner check [puppet] - 10https://gerrit.wikimedia.org/r/348667 [17:03:40] volans: _joe_: ^ [17:04:00] RECOVERY - parsoid on wtp1013 is OK: HTTP OK: HTTP/1.1 200 OK - 1019 bytes in 0.009 second response time [17:04:21] it's a huge gallery [17:04:24] * mobrovac sighs [17:06:55] (03CR) 10Dzahn: [C: 032] monitoring: add timeout parameter to bad_directory_owner check [puppet] - 10https://gerrit.wikimedia.org/r/348667 (owner: 10Dzahn) [17:08:19] mobrovac: can I proceed with my restart of nutcracker in codfw? I'd say yes but I didn't want to mix two things :) [17:08:45] yeah sure elukey, go ahead [17:08:53] thanks :) [17:09:37] !log restart nutcracker in codfw (profile::mediawiki::nutcracker) to make sure that all the daemons are running with the latest config [17:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:54] (03CR) 10Paladox: "This won't be needed in gerrit 2.14. So we could remove bouncy castle all together once we upgrade too." [puppet] - 10https://gerrit.wikimedia.org/r/348690 (https://phabricator.wikimedia.org/T163185) (owner: 10Muehlenhoff) [17:11:56] (03CR) 10Chad: [C: 031] "Yes, I already said that." [puppet] - 10https://gerrit.wikimedia.org/r/348690 (https://phabricator.wikimedia.org/T163185) (owner: 10Muehlenhoff) [17:14:12] (03PS6) 10Dzahn: monitoring: fix wrong parameter bug in file ownership check [puppet] - 10https://gerrit.wikimedia.org/r/348664 [17:17:55] (03CR) 10Dzahn: [C: 032] monitoring: fix wrong parameter bug in file ownership check [puppet] - 10https://gerrit.wikimedia.org/r/348664 (owner: 10Dzahn) [17:19:21] marostegui: thanks for +1 on mysql grant request. should i just merge that and that's it? [17:19:23] !log mobrovac@tin Started deploy [restbase/deploy@1bfada4]: Blacklist all user pages on commons [17:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:33] _joe_: volans: ^ [17:19:46] mobrovac: thanks [17:19:54] !log ssastry@tin Started deploy [parsoid/deploy@b067328]: Deploying Parsoid to bump heap limits to 900m (from 600m) [17:19:54] <_joe_> mobrovac: seems sensible [17:20:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:35] 06Operations, 13Patch-For-Review: acpi_pad issues - https://phabricator.wikimedia.org/T162850#3190433 (10fgiunchedi) [17:21:37] 06Operations: acpi_pad consuming 100% CPU on tin - https://phabricator.wikimedia.org/T163158#3190431 (10fgiunchedi) 05Open>03Resolved tin rebooted, I've enabled HT and fixed performance profile to be "performance per watt (OS)", see also the icinga task for alarming on this and parent task [17:22:45] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:23:55] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:24:45] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [17:26:19] !log ssastry@tin Finished deploy [parsoid/deploy@b067328]: Deploying Parsoid to bump heap limits to 900m (from 600m) (duration: 06m 25s) [17:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:35] !log mobrovac@tin Finished deploy [restbase/deploy@1bfada4]: Blacklist all user pages on commons (duration: 07m 12s) [17:26:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:55] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [17:29:27] volans, _joe_ codfw parsoid cluster is more powerful than eqiad for sure. avg. load last month on codfw was 17% compared to about ~50% on eqiad before the switch of background processing traffic to codfw. [17:30:02] more workers + (slightly?) more powerful cpus [17:30:16] ok, that make more sense [17:30:17] mobrovac, deployment finished. fyi. [17:33:34] kk thnx subbu [17:33:41] things are recovering [17:37:44] (03CR) 10Paladox: "Will a downgrade of bouncy castle cause sshd not to work in gerrit, I'm looking at https://github.com/GerritCodeReview/gerrit/blob/482e111" [puppet] - 10https://gerrit.wikimedia.org/r/348690 (https://phabricator.wikimedia.org/T163185) (owner: 10Muehlenhoff) [17:39:41] (03PS1) 10Dzahn: mariadb: clean up duplicate GRANTs for phstats user [puppet] - 10https://gerrit.wikimedia.org/r/348779 [17:40:58] PROBLEM - restbase endpoints health on restbase-dev1003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:41:48] RECOVERY - restbase endpoints health on restbase-dev1003 is OK: All endpoints are healthy [17:42:27] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/#/c/348779/ to clean up duplicates" [puppet] - 10https://gerrit.wikimedia.org/r/348565 (owner: 10Dzahn) [17:43:39] (03CR) 10Dzahn: mariadb: grant user 'phstats' additional select on differential db (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/348565 (owner: 10Dzahn) [17:45:28] PROBLEM - Check Varnish expiry mailbox lag on cp3039 is CRITICAL: CRITICAL: expiry mailbox lag is 671434 [17:46:29] (03PS2) 10Dzahn: mariadb: clean up duplicate GRANTs for phstats user [puppet] - 10https://gerrit.wikimedia.org/r/348779 [17:47:31] (03PS2) 10Dzahn: DHCP: Add MAC address entries for db20[7-9][0-9]. [puppet] - 10https://gerrit.wikimedia.org/r/348758 (owner: 10Papaul) [17:49:08] 06Operations, 10ops-codfw, 13Patch-For-Review: setup naos/WMF6406 as new codfw deployment server - https://phabricator.wikimedia.org/T162900#3178786 (10fgiunchedi) I've merged @RobH patch and ran puppet on naos, issues I've encountered so far: [x] `trebuchet` user/group need to have a specific uid/gid for r... [17:55:20] (03PS2) 10Cmjohnson: adding mac address for new db's less db1098--not connecting will add this later T162233 [puppet] - 10https://gerrit.wikimedia.org/r/348755 [17:55:57] (03PS3) 10Cmjohnson: adding mac address for new db's less db1098--not connecting will add this later T162233 [puppet] - 10https://gerrit.wikimedia.org/r/348755 [17:57:14] (03CR) 10Paladox: "See https://github.com/GerritCodeReview/gerrit/commit/e2921b62f6c09d574a25aaa079d538ac499ef382" [puppet] - 10https://gerrit.wikimedia.org/r/348690 (https://phabricator.wikimedia.org/T163185) (owner: 10Muehlenhoff) [17:57:41] (03CR) 10Cmjohnson: [C: 032] adding mac address for new db's less db1098--not connecting will add this later T162233 [puppet] - 10https://gerrit.wikimedia.org/r/348755 (owner: 10Cmjohnson) [18:11:01] (03PS1) 10Urbanecm: Remove https://sourcecode.berlin/feed/ from RSS whitelist for mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348782 (https://phabricator.wikimedia.org/T163217) [18:15:34] 06Operations, 10DBA, 10Traffic: dbtree: make wasat a working backend and become active-active - https://phabricator.wikimedia.org/T163141#3187493 (10BBlack) Yeah, leave the traffic tag as we'll want to basically revert https://gerrit.wikimedia.org/r/#/c/348456/ once dbtree is ready for it. [18:19:03] 06Operations, 10DNS, 10Traffic, 06Services (next): icinga alerts on nodejs services when a recdns server is depooled - https://phabricator.wikimedia.org/T162818#3190756 (10BBlack) 05Open>03stalled @Gwicke yeah we should. Regardless, after we're done with the current codfw switchover/switchback, I thin... [18:19:42] 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: eqiad rack/setup 11 new DB servers - https://phabricator.wikimedia.org/T162233#3190761 (10Cmjohnson) 10 of the 11 servers that arrived are racked, switch configured, raid completed, idrac setup and dns entries for both mgmt and production. They are rea... [18:24:18] 06Operations, 10Traffic: Implement Varnish-level rough ratelimiting - https://phabricator.wikimedia.org/T163233#3190763 (10BBlack) [18:24:32] 06Operations, 10Traffic: Implement Varnish-level rough ratelimiting - https://phabricator.wikimedia.org/T163233#3190776 (10BBlack) p:05Triage>03Normal [18:25:06] 06Operations, 10Traffic: Implement Varnish-level rough ratelimiting - https://phabricator.wikimedia.org/T163233#3190763 (10BBlack) [18:25:09] 06Operations, 06Discovery, 06Maps, 10Traffic, 03Interactive-Sprint: Rate-limit browsers without referers - https://phabricator.wikimedia.org/T154704#3190777 (10BBlack) [18:26:18] 06Operations, 10Traffic: Implement Varnish-level rough ratelimiting - https://phabricator.wikimedia.org/T163233#3190763 (10BBlack) [18:26:21] 06Operations, 10Traffic: Increase request limits for GETs to /api/rest_v1/ - https://phabricator.wikimedia.org/T118365#3190779 (10BBlack) [18:26:38] (03PS1) 10Cmjohnson: Adding mgmt dns entries for frpm1001 T162298 [dns] - 10https://gerrit.wikimedia.org/r/348783 [18:28:00] 06Operations, 10Traffic, 13Patch-For-Review: Upgrade all cache clusters to Varnish 4 - https://phabricator.wikimedia.org/T131499#3190790 (10BBlack) [18:28:03] 06Operations, 10Traffic, 13Patch-For-Review: Upgrade to Varnish 4: things to remember - https://phabricator.wikimedia.org/T126206#3190786 (10BBlack) 05Open>03Resolved Going back over some of the unchecked boxes at the top: # Ratelimiting and general VCL reload memleak issues can be investigated separat... [18:30:11] 06Operations, 10Traffic: unix domain socket listening for varnish4 - https://phabricator.wikimedia.org/T138084#3190801 (10BBlack) 05Open>03Resolved a:03BBlack For now we've solved the pragmatic issues in other ways: some general nginx/varnish tuning, kernel TCP params tuning, and using 8x TCP sockets in... [18:44:37] (03CR) 10Cmjohnson: [C: 032] Adding mgmt dns entries for frpm1001 T162298 [dns] - 10https://gerrit.wikimedia.org/r/348783 (owner: 10Cmjohnson) [18:45:23] RECOVERY - Check Varnish expiry mailbox lag on cp3039 is OK: OK: expiry mailbox lag is 203 [18:59:06] (03PS1) 10Dzahn: monitoring:bad_directory_owner: set default timeout 10s [puppet] - 10https://gerrit.wikimedia.org/r/348786 [19:03:06] (03CR) 10Dzahn: [C: 032] monitoring:bad_directory_owner: set default timeout 10s [puppet] - 10https://gerrit.wikimedia.org/r/348786 (owner: 10Dzahn) [19:10:34] (03PS1) 10Rush: tools: bump and consolidate timeout values for client NFS [puppet] - 10https://gerrit.wikimedia.org/r/348788 [19:11:42] (03PS2) 10Rush: tools: bump and consolidate timeout values for client NFS [puppet] - 10https://gerrit.wikimedia.org/r/348788 [19:12:55] (03CR) 10Dzahn: [C: 032] "yea, 18:66:DA is a Dell prefix (not that it really was in question heh. just for the curious, the classic coffer.com page and others are a" [puppet] - 10https://gerrit.wikimedia.org/r/348758 (owner: 10Papaul) [19:13:11] (03PS3) 10Dzahn: DHCP: Add MAC address entries for db20[7-9][0-9]. [puppet] - 10https://gerrit.wikimedia.org/r/348758 (owner: 10Papaul) [19:16:18] (03CR) 10Dzahn: "@papaul puppet ran on install2002 and reloaded DHCP service. you can start installs" [puppet] - 10https://gerrit.wikimedia.org/r/348758 (owner: 10Papaul) [19:16:41] (03CR) 10Andrew Bogott: [C: 031] "This doesn't look harmful to me. It would be better to have NFS be responsive but there's no advantage to killing puppet when it isn't." [puppet] - 10https://gerrit.wikimedia.org/r/348788 (owner: 10Rush) [19:17:17] (03PS3) 10Dzahn: standardize "include ::profile:*", "include ::nrpe" [puppet] - 10https://gerrit.wikimedia.org/r/347023 [19:17:50] (03PS3) 10Rush: tools: bump and consolidate timeout values for client NFS [puppet] - 10https://gerrit.wikimedia.org/r/348788 [19:24:25] (03CR) 10Rush: [C: 032] tools: bump and consolidate timeout values for client NFS [puppet] - 10https://gerrit.wikimedia.org/r/348788 (owner: 10Rush) [19:26:22] 06Operations, 10ops-eqiad: ripe-atlas-eqiad is down - https://phabricator.wikimedia.org/T163243#3191129 (10faidon) [19:29:08] (03CR) 10Dzahn: [C: 032] "already compiled as no-op (minus unrelated compiler issues)" [puppet] - 10https://gerrit.wikimedia.org/r/347023 (owner: 10Dzahn) [19:29:15] (03PS4) 10Dzahn: standardize "include ::profile:*", "include ::nrpe" [puppet] - 10https://gerrit.wikimedia.org/r/347023 [19:34:32] 06Operations, 13Patch-For-Review, 15User-Elukey: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850#3191200 (10cscott) That's probably "as expected" behavior -- the ocg_job_status contains one entry for every cached PDF, and we cache quite a few of them. I'm not c... [19:35:41] 06Operations, 13Patch-For-Review: acpi_pad issues - https://phabricator.wikimedia.org/T162850#3191203 (10Dzahn) All subtasks are resolved. acpid_pad has been unloaded and blacklisted on all `Dell R320` machines. I suggest we try closing it and watch if it ever happens again. If it does not the issue was limi... [19:36:34] 06Operations, 13Patch-For-Review: acpi_pad issues - https://phabricator.wikimedia.org/T162850#3191209 (10Dzahn) 05Open>03Resolved a:03Dzahn [19:37:40] 06Operations, 13Patch-For-Review, 15User-Elukey: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850#3191223 (10cscott) If you look at `ocg.pdf.job_queue_length` in https://graphite.wikimedia.org/ you'll see that the job queue is actually quite small. The `ocg.pdf.... [19:38:36] 06Operations, 10Monitoring: Add Icinga check for CPU frequency on Dell R320 - https://phabricator.wikimedia.org/T163220#3191225 (10Dzahn) a:03Dzahn [19:39:06] 06Operations, 10Monitoring: Add Icinga check for CPU frequency on Dell R320 - https://phabricator.wikimedia.org/T163220#3190291 (10Dzahn) [19:39:26] 06Operations, 10Monitoring: Add Icinga check for CPU frequency on Dell R320 - https://phabricator.wikimedia.org/T163220#3190291 (10Dzahn) [19:41:36] 06Operations, 13Patch-For-Review: acpi_pad issues - https://phabricator.wikimedia.org/T162850#3191244 (10Dzahn) [19:41:40] 06Operations, 10Monitoring: Add Icinga check for CPU frequency on Dell R320 - https://phabricator.wikimedia.org/T163220#3191243 (10Dzahn) [19:48:00] (03PS1) 10Dzahn: typos: add 'include nrpe' [puppet] - 10https://gerrit.wikimedia.org/r/348794 [19:52:37] 06Operations, 10Traffic, 06Community-Liaisons (Jul-Sep 2017): Communicate this security change to affected editors and other community members - https://phabricator.wikimedia.org/T163251#3191313 (10Whatamidoing-WMF) [19:53:09] (03PS2) 10Dzahn: typos: add 'include nrpe' [puppet] - 10https://gerrit.wikimedia.org/r/348794 [19:54:56] (03CR) 10Framawiki: [C: 031] Remove https://sourcecode.berlin/feed/ from RSS whitelist for mw.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/348782 (https://phabricator.wikimedia.org/T163217) (owner: 10Urbanecm) [19:58:03] (03CR) 10Dzahn: [C: 032] typos: add 'include nrpe' [puppet] - 10https://gerrit.wikimedia.org/r/348794 (owner: 10Dzahn) [19:58:33] (03PS2) 10Dzahn: netboot: fix/adjust partman config for rdb servers [puppet] - 10https://gerrit.wikimedia.org/r/348666 (https://phabricator.wikimedia.org/T140442) [20:13:43] (03PS3) 10Dzahn: netboot: fix/adjust partman config for rdb servers [puppet] - 10https://gerrit.wikimedia.org/r/348666 (https://phabricator.wikimedia.org/T140442) [20:14:41] (03PS4) 10Dzahn: netboot: fix/adjust partman config for rdb servers [puppet] - 10https://gerrit.wikimedia.org/r/348666 (https://phabricator.wikimedia.org/T140442) [20:16:49] (03CR) 10Dzahn: [C: 032] netboot: fix/adjust partman config for rdb servers [puppet] - 10https://gerrit.wikimedia.org/r/348666 (https://phabricator.wikimedia.org/T140442) (owner: 10Dzahn) [20:17:11] (03PS1) 10Andrew Bogott: Keystone: Use keystone-manage token_flush rather than a mysql call [puppet] - 10https://gerrit.wikimedia.org/r/348806 [20:18:26] (03CR) 10jerkins-bot: [V: 04-1] Keystone: Use keystone-manage token_flush rather than a mysql call [puppet] - 10https://gerrit.wikimedia.org/r/348806 (owner: 10Andrew Bogott) [20:19:33] (03PS2) 10Andrew Bogott: Keystone: Use keystone-manage token_flush rather than a mysql call [puppet] - 10https://gerrit.wikimedia.org/r/348806 [20:20:33] 06Operations, 13Patch-For-Review: reinstall rdb100[56] with RAID - https://phabricator.wikimedia.org/T140442#3191435 (10Dzahn) @elukey any hints on what is needed to reinstall one of these? actions that are needed before it's ok taking one down? of course this would be AFTER the dcswitch :) [20:20:36] 06Operations, 13Patch-For-Review, 15User-Elukey: JobQueue Redis codfw replicas periodically lagging - https://phabricator.wikimedia.org/T159850#3191436 (10elukey) Thanks a lot for the explanation @cscott! I'll try to add more details. The rdb eqiad shards periodically replicates to codfw via the Redis proto... [20:20:39] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:22:39] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:23:59] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [20:24:22] (03CR) 10Andrew Bogott: [C: 032] Keystone: Use keystone-manage token_flush rather than a mysql call [puppet] - 10https://gerrit.wikimedia.org/r/348806 (owner: 10Andrew Bogott) [20:26:57] (03PS3) 10Dzahn: Move gerrit to using Bouncycastle as packaged by Debian [puppet] - 10https://gerrit.wikimedia.org/r/348690 (https://phabricator.wikimedia.org/T163185) (owner: 10Muehlenhoff) [20:28:39] (03CR) 10Dzahn: [C: 032] "as stated in comments above, gerrit will not start using the package until symlinks will be added later, so going ahead" [puppet] - 10https://gerrit.wikimedia.org/r/348690 (https://phabricator.wikimedia.org/T163185) (owner: 10Muehlenhoff) [20:28:45] (03CR) 10Paladox: [C: 031] "Tested on gerrit-test3 and saw no breakages. Tested ssh too." [puppet] - 10https://gerrit.wikimedia.org/r/348690 (https://phabricator.wikimedia.org/T163185) (owner: 10Muehlenhoff) [20:31:49] (03CR) 10Dzahn: "Notice: /Stage[main]/Packages::Libbcpkix_java/Package[libbcpkix-java]/ensure: ensure changed 'purged' to 'present'" [puppet] - 10https://gerrit.wikimedia.org/r/348690 (https://phabricator.wikimedia.org/T163185) (owner: 10Muehlenhoff) [20:36:47] !log bblack@neodymium conftool action : set/pooled=no; selector: name=cp2002.codfw.wmnet,service=varnish-be [20:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:09] RECOVERY - Check Varnish expiry mailbox lag on cp2002 is OK: OK: expiry mailbox lag is 301531 [20:49:40] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] [20:50:39] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [20:50:59] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [21:12:23] !log bblack@neodymium conftool action : set/pooled=yes; selector: name=cp2002.codfw.wmnet,service=varnish-be [21:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:30] (03CR) 10Krinkle: [C: 031] Remove defunct $wgForeignUploadTestEnabled for cross-wiki upload A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/347412 (owner: 10Bartosz Dziewoński) [21:31:06] (03PS1) 10Chad: Removing bouncycastle libraries, installing from debian packages instead [debs/gerrit] - 10https://gerrit.wikimedia.org/r/348857 [21:32:12] (03CR) 10Chad: "Actually, needs debian changelog entry too so we can rebuild" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/348857 (owner: 10Chad) [21:33:15] (03PS1) 10Andrew Bogott: Pipe keystone-manage cron output to /dev/null [puppet] - 10https://gerrit.wikimedia.org/r/348858 [21:34:05] (03PS2) 10Andrew Bogott: Keystone: Redirect keystone-manage cron output to /dev/null [puppet] - 10https://gerrit.wikimedia.org/r/348858 [21:34:23] (03PS2) 10Chad: Removing bouncycastle libraries, installing from debian packages instead [debs/gerrit] - 10https://gerrit.wikimedia.org/r/348857 [21:35:35] (03CR) 10Andrew Bogott: [C: 032] Keystone: Redirect keystone-manage cron output to /dev/null [puppet] - 10https://gerrit.wikimedia.org/r/348858 (owner: 10Andrew Bogott) [21:38:39] PROBLEM - Router interfaces on pfw-eqiad is CRITICAL: CRITICAL: host 208.80.154.218, interfaces up: 109, down: 1, dormant: 0, excluded: 2, unused: 0BRge-2/0/14: down - frdb1002BR [21:39:11] 06Operations, 10ops-ulsfo, 10fundraising-tech-ops, 13Patch-For-Review: rack/setup frbackup2001 - https://phabricator.wikimedia.org/T162469#3191820 (10ayounsi) [21:41:35] 06Operations, 10ops-eqiad, 10fundraising-tech-ops: move frdb1002 from pfw1 to pfw2 - https://phabricator.wikimedia.org/T163268#3191829 (10Jgreen) [21:46:53] 06Operations, 10MediaWiki-General-or-Unknown, 06Release-Engineering-Team, 10Traffic, and 5 others: Make sure we're not relying on HTTP_PROXY headers - https://phabricator.wikimedia.org/T140658#3191871 (10Krinkle) 05Open>03Resolved a:03Krinkle Landed in master for 1.28 ^^ is that known [21:47:39] PROBLEM - Router interfaces on pfw-eqiad ? [21:50:13] (03PS1) 10Andrew Bogott: Keystone: Kill off novaobserver and novaadmin tokens after 2+ hours. [puppet] - 10https://gerrit.wikimedia.org/r/348862 (https://phabricator.wikimedia.org/T163259) [21:51:05] (03CR) 10jerkins-bot: [V: 04-1] Keystone: Kill off novaobserver and novaadmin tokens after 2+ hours. [puppet] - 10https://gerrit.wikimedia.org/r/348862 (https://phabricator.wikimedia.org/T163259) (owner: 10Andrew Bogott) [21:53:56] (03PS2) 10Andrew Bogott: Keystone: Kill off novaobserver and novaadmin tokens after 2+ hours. [puppet] - 10https://gerrit.wikimedia.org/r/348862 (https://phabricator.wikimedia.org/T163259) [21:54:40] 06Operations, 10MediaWiki-General-or-Unknown, 06Release-Engineering-Team, 10Traffic, and 5 others: Make sure we're not relying on HTTP_PROXY headers - https://phabricator.wikimedia.org/T140658#3191909 (10demon) There's probably some extensions that need fixing here too. The Elastica library looks possibly... [21:55:25] (03PS3) 10Andrew Bogott: Keystone: Kill off novaobserver and novaadmin tokens after 2+ hours. [puppet] - 10https://gerrit.wikimedia.org/r/348862 (https://phabricator.wikimedia.org/T163259) [21:55:26] 06Operations, 10MediaWiki-General-or-Unknown, 06Release-Engineering-Team, 10Traffic, and 5 others: Make sure we're not relying on HTTP_PROXY headers - https://phabricator.wikimedia.org/T140658#3191910 (10demon) 05Resolved>03Open [22:01:36] (03PS4) 10Andrew Bogott: Keystone: Kill off novaobserver and novaadmin tokens after 2+ hours. [puppet] - 10https://gerrit.wikimedia.org/r/348862 (https://phabricator.wikimedia.org/T163259) [22:05:22] 06Operations, 10MediaWiki-General-or-Unknown, 06Release-Engineering-Team, 10Traffic, and 5 others: Make sure we're not relying on HTTP_PROXY headers - https://phabricator.wikimedia.org/T140658#3191940 (10demon) 05Open>03Resolved >>! In T140658#3191909, @demon wrote: > There's probably some extensions t... [22:05:35] 06Operations, 10MediaWiki-General-or-Unknown, 06Release-Engineering-Team, 10Traffic, and 5 others: Make sure we're not relying on HTTP_PROXY headers - https://phabricator.wikimedia.org/T140658#3191943 (10demon) [22:12:29] PROBLEM - Disk space on ocg1003 is CRITICAL: DISK CRITICAL - free space: / 1577 MB (3% inode=84%) [22:18:39] PROBLEM - Host lvs2001 is DOWN: PING CRITICAL - Packet loss = 100% [22:18:59] RECOVERY - Host lvs2001 is UP: PING WARNING - Packet loss = 73%, RTA = 36.09 ms [22:21:30] (03CR) 10Paladox: [C: 031] Removing bouncycastle libraries, installing from debian packages instead [debs/gerrit] - 10https://gerrit.wikimedia.org/r/348857 (owner: 10Chad) [22:22:29] RECOVERY - Disk space on ocg1003 is OK: DISK OK [22:30:15] !log ocg1003 gzipping ocg.log for disk space [22:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:27] 06Operations, 06Performance-Team, 15User-fgiunchedi: Backfill restored coal whisper files with current data - https://phabricator.wikimedia.org/T163194#3192065 (10Krinkle) p:05Triage>03High a:03Krinkle [22:39:19] PROBLEM - puppet last run on cp3041 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:46:25] (03PS1) 10Dzahn: dumps: skip cert monitoring where Letsencrypt is disabled [puppet] - 10https://gerrit.wikimedia.org/r/348869 [22:47:39] (03CR) 10jerkins-bot: [V: 04-1] dumps: skip cert monitoring where Letsencrypt is disabled [puppet] - 10https://gerrit.wikimedia.org/r/348869 (owner: 10Dzahn) [22:49:13] (03PS2) 10Dzahn: dumps: skip cert monitoring where Letsencrypt is disabled [puppet] - 10https://gerrit.wikimedia.org/r/348869 [22:52:42] (03CR) 10Dzahn: [C: 032] dumps: skip cert monitoring where Letsencrypt is disabled [puppet] - 10https://gerrit.wikimedia.org/r/348869 (owner: 10Dzahn) [22:52:48] (03PS3) 10Dzahn: dumps: skip cert monitoring where Letsencrypt is disabled [puppet] - 10https://gerrit.wikimedia.org/r/348869 [22:56:18] (03CR) 10RobH: [C: 031] dumps: skip cert monitoring where Letsencrypt is disabled [puppet] - 10https://gerrit.wikimedia.org/r/348869 (owner: 10Dzahn) [22:56:53] 06Operations: Four different PHP/HHVM versions on the cluster - https://phabricator.wikimedia.org/T163278#3192140 (10Catrope) [22:57:19] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 22 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [22:59:16] 06Operations, 10Traffic: Select or Acquire Address Space for Asia Cache DC - https://phabricator.wikimedia.org/T156256#3192167 (10DFoy) @BBlack Is there an update for ETA for updating the zero whitelisting IPs? I will need several months of lead time before this goes live to get ~50 partners worldwide to upda... [23:00:20] 06Operations, 06Performance-Team, 15User-fgiunchedi: Backfill restored coal whisper files with current data - https://phabricator.wikimedia.org/T163194#3192168 (10Krinkle) I tried out `whisper-fill` in my home directory on a copy of the `coal.loadEventEnd` metric and copied it to graphite1001 as `coal.merged... [23:01:47] PROBLEM - cassandra-a CQL 10.64.48.98:9042 on restbase1018 is CRITICAL: connect to address 10.64.48.98 and port 9042: Connection refused [23:01:47] PROBLEM - cassandra-c CQL 10.64.48.100:9042 on restbase1018 is CRITICAL: connect to address 10.64.48.100 and port 9042: Connection refused [23:01:57] PROBLEM - MD RAID on restbase1018 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 [23:01:58] ACKNOWLEDGEMENT - MD RAID on restbase1018 is CRITICAL: CRITICAL: State: degraded, Active: 11, Working: 11, Failed: 1, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T163280 [23:01:58] PROBLEM - cassandra-b SSL 10.64.48.99:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [23:01:58] PROBLEM - cassandra-c SSL 10.64.48.100:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [23:02:02] !log ms1001 - deleting old GlobalCert SSL cert for dumps.wm that was about to expire and is replaced by Letsencrypt, [23:02:07] 06Operations, 10ops-eqiad: Degraded RAID on restbase1018 - https://phabricator.wikimedia.org/T163280#3192172 (10ops-monitoring-bot) [23:02:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:17] PROBLEM - cassandra-b service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [23:02:17] PROBLEM - cassandra-c service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [23:02:18] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw is OK: OK - failed 17 probes of 278 (alerts on 19) - https://atlas.ripe.net/measurements/1791212/#!map [23:02:47] PROBLEM - cassandra-a SSL 10.64.48.98:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [23:02:47] PROBLEM - cassandra-a service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [23:02:47] PROBLEM - cassandra-b CQL 10.64.48.99:9042 on restbase1018 is CRITICAL: connect to address 10.64.48.99 and port 9042: Connection refused [23:02:47] heh, well the monitoring bot already tells us [23:02:56] now it just needs to auto-ACK all those too :p [23:03:08] mutante: eheheh [23:04:07] ACKNOWLEDGEMENT - cassandra-a CQL 10.64.48.98:9042 on restbase1018 is CRITICAL: connect to address 10.64.48.98 and port 9042: Connection refused daniel_zahn https://phabricator.wikimedia.org/T163280 [23:04:07] ACKNOWLEDGEMENT - cassandra-a SSL 10.64.48.98:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused daniel_zahn https://phabricator.wikimedia.org/T163280 [23:04:07] ACKNOWLEDGEMENT - cassandra-a service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed daniel_zahn https://phabricator.wikimedia.org/T163280 [23:04:07] ACKNOWLEDGEMENT - cassandra-b CQL 10.64.48.99:9042 on restbase1018 is CRITICAL: connect to address 10.64.48.99 and port 9042: Connection refused daniel_zahn https://phabricator.wikimedia.org/T163280 [23:04:07] ACKNOWLEDGEMENT - cassandra-b SSL 10.64.48.99:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused daniel_zahn https://phabricator.wikimedia.org/T163280 [23:04:07] ACKNOWLEDGEMENT - cassandra-b service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed daniel_zahn https://phabricator.wikimedia.org/T163280 [23:04:07] ACKNOWLEDGEMENT - cassandra-c CQL 10.64.48.100:9042 on restbase1018 is CRITICAL: connect to address 10.64.48.100 and port 9042: Connection refused daniel_zahn https://phabricator.wikimedia.org/T163280 [23:04:08] ACKNOWLEDGEMENT - cassandra-c SSL 10.64.48.100:7001 on restbase1018 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused daniel_zahn https://phabricator.wikimedia.org/T163280 [23:04:08] ACKNOWLEDGEMENT - cassandra-c service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed daniel_zahn https://phabricator.wikimedia.org/T163280 [23:04:09] ACKNOWLEDGEMENT - puppet last run on restbase1018 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[restbase/deploy] daniel_zahn https://phabricator.wikimedia.org/T163280 [23:04:59] !log dzahn@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase1018.eqiad.wmnet [23:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:45] 06Operations: Four different PHP/HHVM versions on the cluster - https://phabricator.wikimedia.org/T163278#3192193 (10Catrope) This seems to be partially expected. {T158176} says "3.18.2 is running on the mediawiki canaries, but the wider rollout is held back until after the DC switchover", which seems sensible. [23:06:07] PROBLEM - Check systemd state on restbase1018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:06:54] ACKNOWLEDGEMENT - Check systemd state on restbase1018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T163280 [23:07:12] 06Operations, 10ops-eqiad: Degraded RAID on restbase1018 - https://phabricator.wikimedia.org/T163280#3192172 (10Dzahn) depooled - 16:07 <+logmsgbot> !log dzahn@puppetmaster1001 conftool action : set/pooled=no; selector: name=restbase1018.eqiad.wmnet [23:08:27] RECOVERY - puppet last run on cp3041 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [23:16:03] 06Operations, 10Traffic: Select or Acquire Address Space for Asia Cache DC - https://phabricator.wikimedia.org/T156256#3192246 (10BBlack) We still have no real ETA on the IP addresses. We're attempting to acquire the address space from APNIC. They're (reasonably) requiring proof of our needs, which includes... [23:19:35] 06Operations: Four different PHP/HHVM versions on the cluster - https://phabricator.wikimedia.org/T163278#3192140 (10Dzahn) The easiest start here would be to upgrade mwdebug1001/1002 to 3.18.2. That seems to make sense when some real appservers are already on it. They should probably always be updated first be... [23:21:19] 06Operations: Four different PHP/HHVM versions on the cluster - https://phabricator.wikimedia.org/T163278#3192272 (10Dzahn) On any other day i would probably just do that since they are debug hosts. Though right now might be a bad moment. The "is running on the canaries" should cover mwdebug* though, shouldn't it. [23:23:33] 06Operations: Four different PHP/HHVM versions on the cluster - https://phabricator.wikimedia.org/T163278#3192140 (10demon) >>! In T163278#3192272, @Dzahn wrote: > On any other day i would probably just do that since they are debug hosts. Though right now might be a bad moment. Possibly, but it should also be p... [23:25:05] 06Operations, 10Monitoring: Tegmen: process spawn loop + failed icinga + failing puppet - https://phabricator.wikimedia.org/T163286#3192276 (10Volans) [23:25:41] 06Operations, 10Collection, 10OfflineContentGenerator, 10Reading-Community-Engagement, and 2 others: Replace OCG in collection extension with Electron - https://phabricator.wikimedia.org/T150872#3192290 (10ovasileva) [23:28:44] 06Operations, 10Monitoring: Tegmen: process spawn loop + failed icinga + failing puppet - https://phabricator.wikimedia.org/T163286#3192308 (10Volans) Could it be that the crontab that runs every 10 minutes had a race with a puppet run and make all this mess... I don't see it wrapped in a `run-no-puppet`: ```... [23:32:07] RECOVERY - Check systemd state on restbase1018 is OK: OK - running: The system is fully operational [23:32:17] RECOVERY - cassandra-b service on restbase1018 is OK: OK - cassandra-b is active [23:32:17] RECOVERY - cassandra-c service on restbase1018 is OK: OK - cassandra-c is active [23:32:59] oh really [23:33:11] self-healing is always appreciated [23:35:07] PROBLEM - Check systemd state on restbase1018 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:35:11] 06Operations, 10ops-eqiad: Degraded RAID on restbase1018 - https://phabricator.wikimedia.org/T163280#3192333 (10Dzahn) 16:05 < icinga-wm> PROBLEM - cassandra-b service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed 16:05 < icinga-wm> PROBLEM - cassandra-c service on res... [23:35:17] PROBLEM - cassandra-b service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [23:35:17] PROBLEM - cassandra-c service on restbase1018 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [23:36:59] mutante spoke to soon ^^ [23:38:36] 06Operations: acpi_pad consuming 100% CPU on tin - https://phabricator.wikimedia.org/T163158#3192340 (10Dzahn) >>! In T163158#3189918, @MoritzMuehlenhoff wrote: > The "Improperly owned -0:0- files in /srv/mediawiki-staging" Icinga check was failing on tin, caused by a timeout of completing the check in time. T... [23:41:24] 06Operations, 06Discovery, 10Monitoring, 10Wikidata, and 3 others: Create response time monitoring for WDQS endpoint - https://phabricator.wikimedia.org/T119915#3192345 (10Dzahn) 05Resolved>03Open [23:43:08] 06Operations, 06Discovery, 10Monitoring, 10Wikidata, and 3 others: Create response time monitoring for WDQS endpoint - https://phabricator.wikimedia.org/T119915#1839912 (10Dzahn) The Icinga/graphite check "Response time for WDQS" is in status "UNKNOWN" because there are "No valid datapoints found". https... [23:43:43] paladox: yea, it's trolling [23:43:48] ok [23:43:56] s/trolling/flapping :) [23:45:16] 06Operations, 10Monitoring: Tegmen: process spawn loop + failed icinga + failing puppet - https://phabricator.wikimedia.org/T163286#3192354 (10Volans) Also, why we do the stop/sync/start all the time instead of just syncing the files on a safe location and have a script `make-icinga-primary` or similar that do... [23:47:14] 06Operations, 10ops-codfw, 13Patch-For-Review: setup naos/WMF6406 as new codfw deployment server - https://phabricator.wikimedia.org/T162900#3192366 (10Dzahn) [23:52:44] mutante: uid fixes for those should be easy [23:54:52] RainbowSprinkles: for naos? yea, i know, i have done it for mira before. this one godog already did [23:55:15] i'll just copy home dir data in addition [23:55:31] Yeah, manually easily done, thinking of a perma fix tho [23:55:39] Still shouldn't be hard [23:58:32] (03PS5) 10Andrew Bogott: Keystone: Kill off novaobserver and novaadmin tokens after 2+ hours. [puppet] - 10https://gerrit.wikimedia.org/r/348862 (https://phabricator.wikimedia.org/T163259) [23:59:24] RainbowSprinkles: yea, would be nice, just too many things to fix and kind of always been a problem. user does have "uid" parameter [23:59:44] Oh, it does? Wonder why naos didn't get it [23:59:44] and that wikitech page is supposed to be the place to define which is right [23:59:58] i mean the puppet type has it, in general