[00:17:04] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [01:09:45] PROBLEM - puppet last run on cp3037 is CRITICAL: CRITICAL: puppet fail [01:36:42] RECOVERY - puppet last run on cp3037 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [01:50:13] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [02:04:19] !log l10nupdate@tin LocalisationUpdate failed (1.28.0-wmf.6) at 2016-06-21 02:04:19+00:00 [02:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:10:55] !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Jun 21 02:10:55 UTC 2016 (duration 6m 36s) [02:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:02:50] 06Operations, 06Release-Engineering-Team, 07Developer-notice, 05Gitblit-Deprecate, and 2 others: Redirect Gitblit urls (git.wikimedia.org) -> Diffusion urls (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2395547 (10Dzahn) @greg i was just going to do it during Wikimania..... [03:09:47] 06Operations, 10DBA: m1-master switch from db1001 to db1016 - https://phabricator.wikimedia.org/T106312#2395548 (10Dzahn) I can check on RT ('rt') and racktables (the other rt) on Wednesday, maybe around 18UTC but i dont worry about it since these services are just used by ops themselves. That said, please no... [03:11:39] 06Operations, 07Blocked-on-RelEng, 05Gitblit-Deprecate, 13Patch-For-Review: Phase out antimony.wikimedia.org (git.wikimedia.org / gitblit) - https://phabricator.wikimedia.org/T123718#2395549 (10Dzahn) @Paladox this is about decom'ing antimony after gitblit is gone. slightly different. but i will take it [03:11:48] 06Operations, 07Blocked-on-RelEng, 05Gitblit-Deprecate, 13Patch-For-Review: Phase out antimony.wikimedia.org (git.wikimedia.org / gitblit) - https://phabricator.wikimedia.org/T123718#2395550 (10Dzahn) a:03Dzahn [03:19:30] PROBLEM - Disk space on elastic1024 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80224 MB (15% inode=99%) [03:25:50] RECOVERY - Disk space on elastic1024 is OK: DISK OK [03:30:18] 07Blocked-on-Operations, 06Operations, 10Wikidata, 10Wikimedia-Language-setup, and 2 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2395597 (10Dzahn) 05Open>03stalled [03:55:27] (03CR) 10Dzahn: Restart exim daily on Monday to Friday (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/294929 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [03:57:34] (03CR) 10Dzahn: "20after4, Chad, do you agree with this?" [puppet] - 10https://gerrit.wikimedia.org/r/295011 (owner: 10Paladox) [04:00:46] (03CR) 10Dzahn: "i see just one merged commit but on "stable" not "wmf/stable"?" [puppet] - 10https://gerrit.wikimedia.org/r/293818 (owner: 1020after4) [04:07:24] 06Operations, 07Puppet, 13Patch-For-Review: Reconsider the aligning arrows puppet lint - https://phabricator.wikimedia.org/T137763#2378054 (10Dzahn) Please keep this check. I have fixed ALL of these across the entire repo before letting it vote. That was a lot of work and it has already been done. [04:19:59] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [04:20:11] 06Operations, 10Mail, 10OTRS: E-mail incorrectly forwarded to wm-cz OTRS e-mail - https://phabricator.wikimedia.org/T129743#2114650 (10Dzahn) What is the proposed change here? wikimedia.cz is not controlled by WMF but by WMCZ http://www.nic.cz/whois/?d=wikimedia.cz [04:21:46] 06Operations, 06Labs, 10Labs-Infrastructure, 06Reading-Web-Backlog, and 2 others: https://wikitech.m.wikimedia.org/ serves wikimedia.org portal - https://phabricator.wikimedia.org/T120527#2395633 (10Dzahn) [04:27:36] (03PS1) 10Dzahn: nagios_common: delete check_http_bits command [puppet] - 10https://gerrit.wikimedia.org/r/295321 (https://phabricator.wikimedia.org/T107430) [04:33:42] 06Operations, 13Patch-For-Review: Remove secure.wikimedia.org - https://phabricator.wikimedia.org/T120790#1861112 (10Dzahn) I guess we should call it declined then... [04:47:14] (03PS1) 10Dzahn: drac,icinga,ipmi: do not ensure => latest [puppet] - 10https://gerrit.wikimedia.org/r/295322 (https://phabricator.wikimedia.org/T115348) [04:47:57] (03CR) 10Dzahn: "comments on the tickets sound like we are leaning towards declining this change and want to keep the redirects indefinitely.." [puppet] - 10https://gerrit.wikimedia.org/r/257510 (https://phabricator.wikimedia.org/T120790) (owner: 10Reedy) [05:07:55] 06Operations, 10Wikimedia-Mailing-lists: Please reset password of hackathonorganizers mailing list - https://phabricator.wikimedia.org/T137873#2382117 (10Dzahn) I ran the /var/lib/mailman/bin/change_pw command with -l hackathonorganizers and _without_ the "quiet" option, which means you should have received em... [05:08:12] 06Operations, 10Wikimedia-Mailing-lists: Please reset password of hackathonorganizers mailing list - https://phabricator.wikimedia.org/T137873#2395651 (10Dzahn) 05Open>03Resolved a:03Dzahn [05:16:06] 06Operations, 10ops-esams: Move cp3030+ from OE14 to OE13 in racktables - https://phabricator.wikimedia.org/T136403#2395653 (10Dzahn) [05:34:31] 06Operations, 06Labs, 10Labs-Infrastructure: labs precise and jessie instance not accessible after provisioning - https://phabricator.wikimedia.org/T117673#2395660 (10Dzahn) [05:35:52] 06Operations: fix up log retention on log collection/storage hosts - https://phabricator.wikimedia.org/T92839#2395662 (10Dzahn) [05:38:37] 06Operations: fix up log retention on log collection/storage hosts - https://phabricator.wikimedia.org/T92839#1121529 (10Dzahn) also T87792 and T84618 and T114395 [05:41:30] 06Operations, 07SEO: GWT accounts - https://phabricator.wikimedia.org/T103567#2395678 (10Dzahn) 05Open>03Resolved [05:41:55] 06Operations, 07Privacy, 07audits-data-retention: Gerrit seemingly violates data retention guidelines - https://phabricator.wikimedia.org/T114395#2395679 (10Dzahn) a:03Dzahn [05:42:07] 06Operations, 07Privacy, 07audits-data-retention: Gerrit seemingly violates data retention guidelines - https://phabricator.wikimedia.org/T114395#1694145 (10Dzahn) p:05Normal>03High [05:42:30] 06Operations, 06Release-Engineering-Team, 07Privacy, 07audits-data-retention: Gerrit seemingly violates data retention guidelines - https://phabricator.wikimedia.org/T114395#1694145 (10Dzahn) [05:43:51] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Grafana login issue for @thiemowmde - https://phabricator.wikimedia.org/T135994#2395683 (10Dzahn) [05:44:02] 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Grafana login issue for @thiemowmde - https://phabricator.wikimedia.org/T135994#2318389 (10Dzahn) @Robh [05:46:08] 06Operations, 07Graphite: Grafana login issue for @thiemowmde - https://phabricator.wikimedia.org/T135994#2395686 (10Dzahn) [05:48:01] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [05:48:58] 06Operations, 10ops-codfw, 10hardware-requests: procure syslog hardware in codfw - https://phabricator.wikimedia.org/T138075#2395687 (10Dzahn) [05:52:33] 06Operations, 10vm-requests: eqiad/codfw: 1 VM request for prometheus - https://phabricator.wikimedia.org/T136313#2395689 (10Dzahn) a:03Dzahn [05:58:12] 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Privacy, 07audits-data-retention: Gerrit seemingly violates data retention guidelines - https://phabricator.wikimedia.org/T114395#2395690 (10greg) [06:10:44] 06Operations, 10Traffic, 06Community-Liaisons (Jul-Sep-2016): Help contact bot owners about the end of HTTP access to the API - https://phabricator.wikimedia.org/T136674#2343854 (10Dzahn) Brandon/Sherry asked me to contact user Paulis for his bot Fkraus because he speaks German. I mailed him in German about it. [06:17:41] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [06:19:24] 06Operations, 10Mail, 10OTRS: E-mail incorrectly forwarded to wm-cz OTRS e-mail - https://phabricator.wikimedia.org/T129743#2395712 (10Urbanecm) I know. info@wikimedia.cz is an alias which forwards all incoming mails to wm-cz@wikimedia.org. But (I don't know why) the mails ends up in info-cs@wikimedia.org. S... [06:20:05] 06Operations, 10Mail, 10OTRS: E-mail incorrectly forwarded to wm-cz OTRS e-mail - https://phabricator.wikimedia.org/T129743#2395713 (10Urbanecm) Our config is set up correctly as you can see in the example in the description above. [06:27:22] 06Operations, 06Release-Engineering-Team, 07Developer-notice, 05Gitblit-Deprecate, and 2 others: Redirect Gitblit urls (git.wikimedia.org) -> Diffusion urls (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2395719 (10greg) Thanks @dzahn. Just trying to set a good example :) [06:27:34] 06Operations, 06Release-Engineering-Team, 07Developer-notice, 05Gitblit-Deprecate, and 2 others: Redirect Gitblit urls (git.wikimedia.org) -> Diffusion urls (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2395720 (10Paladox) @dzahn we can create the patch's then merge them... [06:30:21] PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: puppet fail [06:31:10] PROBLEM - puppet last run on subra is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:10] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:30] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:52] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:52] PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 2 failures [06:33:21] PROBLEM - puppet last run on mw1120 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:30] 06Operations, 10Mail, 10OTRS: E-mail incorrectly forwarded to wm-cz OTRS e-mail - https://phabricator.wikimedia.org/T129743#2395737 (10Matthewrbowker) OTRS looks good. wm-cz@wikimedia.org places email into the queue chapters::wm-cz according to https://ticket.wikimedia.org/otrs/index.pl?Action=AdminSystemAd... [06:34:40] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:39:34] 06Operations, 10Mail, 10OTRS: E-mail incorrectly forwarded to wm-cz OTRS e-mail - https://phabricator.wikimedia.org/T129743#2395742 (10Dzahn) >>! In T129743#2395712, @Urbanecm wrote: > I know. info@wikimedia.cz is an alias which forwards all incoming mails to wm-cz@wikimedia.org. But (I don't know why) the m... [06:41:21] RECOVERY - Apache HTTP on mw1252 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.027 second response time [06:41:43] !log restarted hhvm on mw1252 [06:41:45] 06Operations, 06Release-Engineering-Team, 07Developer-notice, 05Gitblit-Deprecate, and 2 others: Redirect Gitblit urls (git.wikimedia.org) -> Diffusion urls (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2395743 (10Paladox) @greg I think we can do it on the date you like... [06:41:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:42:11] RECOVERY - HHVM rendering on mw1252 is OK: HTTP OK: HTTP/1.1 200 OK - 66630 bytes in 0.124 second response time [06:42:19] 06Operations, 10Mail, 10OTRS: E-mail incorrectly forwarded to wm-cz OTRS e-mail - https://phabricator.wikimedia.org/T129743#2395744 (10Urbanecm) The mail which is in the example in the description ended up in info-cs@wikimedia.org. I can try to send a test mail to info@wikimedia.cz but I can't check where it... [06:42:54] (03CR) 1020after4: "this is actually obsolete now that we have arcanist and libphutil packaged" [puppet] - 10https://gerrit.wikimedia.org/r/293818 (owner: 1020after4) [06:43:08] (03Abandoned) 1020after4: use wmf/stable branch of arcanist and libphutil [puppet] - 10https://gerrit.wikimedia.org/r/293818 (owner: 1020after4) [06:44:13] 06Operations, 10Mail, 10OTRS: E-mail incorrectly forwarded to wm-cz OTRS e-mail - https://phabricator.wikimedia.org/T129743#2395745 (10Urbanecm) >! In T129743#2395742, @Dzahn wrote: > I checked the exim alias file that is under control of operations but there is no wm-cz@ in there. This seems to be all handl... [06:46:25] 06Operations, 06Release-Engineering-Team, 07Developer-notice, 05Gitblit-Deprecate, and 2 others: Redirect Gitblit urls (git.wikimedia.org) -> Diffusion urls (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2395749 (10greg) >>! In T137224#2395743, @Paladox wrote: > @greg I t... [06:49:48] 06Operations, 10Mail, 10OTRS: E-mail incorrectly forwarded to wm-cz OTRS e-mail - https://phabricator.wikimedia.org/T129743#2395750 (10Matthewrbowker) >>! In T129743#2395744, @Urbanecm wrote: >>>! In T129743#2395737, @Matthewrbowker wrote: >> OTRS looks good. wm-cz@wikimedia.org places email into the queue... [06:51:06] 06Operations, 06Release-Engineering-Team, 07Developer-notice, 05Gitblit-Deprecate, and 2 others: Redirect Gitblit urls (git.wikimedia.org) -> Diffusion urls (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2395751 (10greg) Scheduled: https://wikitech.wikimedia.org/wiki/Depl... [06:51:09] 06Operations, 10Mail, 10OTRS: E-mail incorrectly forwarded to wm-cz OTRS e-mail - https://phabricator.wikimedia.org/T129743#2395752 (10Urbanecm) @Matthewrbowker I'll find one. [06:54:33] 06Operations, 10Mail, 10OTRS: E-mail incorrectly forwarded to wm-cz OTRS e-mail - https://phabricator.wikimedia.org/T129743#2395753 (10Matthewrbowker) 05Open>03Resolved a:03Matthewrbowker I found it. OTRS uses inbound email addresses to sort email. So I added another filter for info@wikimedia.cz (htt... [06:56:40] RECOVERY - puppet last run on subra is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:41] RECOVERY - puppet last run on mw1120 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:32] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures [06:57:51] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:01] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:02] RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:11] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [06:58:21] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:22] RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:04] 06Operations, 10Mail, 10OTRS: E-mail incorrectly forwarded to wm-cz OTRS e-mail - https://phabricator.wikimedia.org/T129743#2395756 (10Urbanecm) Thanks! [07:14:20] PROBLEM - Disk space on elastic1024 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 79972 MB (15% inode=99%) [07:15:00] (03CR) 10Muehlenhoff: "Looks good, but let's first land the config change for Jenkins." [puppet] - 10https://gerrit.wikimedia.org/r/295255 (https://phabricator.wikimedia.org/T80385) (owner: 10Hashar) [07:15:28] 06Operations, 06Release-Engineering-Team, 07Developer-notice, 05Gitblit-Deprecate, and 2 others: Redirect Gitblit urls (git.wikimedia.org) -> Diffusion urls (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2395760 (10Paladox) @greg thanks and sorry. [07:18:40] RECOVERY - Disk space on elastic1024 is OK: DISK OK [07:20:31] PROBLEM - puppet last run on mw2084 is CRITICAL: CRITICAL: puppet fail [07:20:45] 06Operations, 10DBA: m1-master switch from db1001 to db1016 - https://phabricator.wikimedia.org/T106312#2395761 (10akosiaris) >>! In T106312#2394393, @jcrespo wrote: > @Akosiaris @MoritzMuehlenhoff @fgiunchedi @demon @Krenair @Dzahn I intend to perform the failover on Wednesday 22, 16:00 UTC. > > I do not re... [07:27:09] (03CR) 10Alexandros Kosiaris: [C: 032] nagios_common: delete check_http_bits command [puppet] - 10https://gerrit.wikimedia.org/r/295321 (https://phabricator.wikimedia.org/T107430) (owner: 10Dzahn) [07:34:51] (03CR) 10Alexandros Kosiaris: "+1ed from me, let's change the jenkins config and run a" [puppet] - 10https://gerrit.wikimedia.org/r/295255 (https://phabricator.wikimedia.org/T80385) (owner: 10Hashar) [07:39:09] !log restarted hhvm on mw1139 (hhvm-dump in /tmp/hhvm.20736.bt.) [07:39:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:41:42] !log restarted hhvm on mw1141 - hhvm was getting SEGV (dump in /tmp/hhvm.8735.bt.) [07:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:42:32] something is definitely weird [07:42:51] icinga is now telling me that other two mw servers have CRITICALs [07:44:18] mw1197 and mw1230, that have ubuntu so definitely not new appservers [07:46:10] seems to be all API servers [07:46:42] and now auto-resolve [07:46:44] *resolved [07:48:31] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [07:48:31] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [07:48:42] RECOVERY - puppet last run on mw2084 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:52:13] (03CR) 10ArielGlenn: [C: 031] Move dataset ferm rules into the role [puppet] - 10https://gerrit.wikimedia.org/r/294930 (owner: 10Muehlenhoff) [07:52:32] (03CR) 10Alexandros Kosiaris: [C: 04-1] Manage Postgresql data dir with Puppet (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/295227 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [07:53:26] 06Operations, 10Ops-Access-Requests, 06Parsing-Team, 06Services: Allow the Services team to administer the Parsoid cluster - https://phabricator.wikimedia.org/T137879#2382336 (10akosiaris) LGTM [07:57:16] (03PS2) 10Muehlenhoff: drac,icinga,ipmi: do not ensure => latest [puppet] - 10https://gerrit.wikimedia.org/r/295322 (https://phabricator.wikimedia.org/T115348) (owner: 10Dzahn) [07:57:52] (03CR) 10Alexandros Kosiaris: [C: 032] Phabricator: remove remote 'origin' from system-wide gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/294945 (https://phabricator.wikimedia.org/T137819) (owner: 1020after4) [07:57:57] (03PS4) 10Alexandros Kosiaris: Phabricator: remove remote 'origin' from system-wide gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/294945 (https://phabricator.wikimedia.org/T137819) (owner: 1020after4) [07:58:02] (03CR) 10Alexandros Kosiaris: [V: 032] Phabricator: remove remote 'origin' from system-wide gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/294945 (https://phabricator.wikimedia.org/T137819) (owner: 1020after4) [07:58:31] (03CR) 10Gehel: Manage Postgresql data dir with Puppet (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/295227 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [08:00:51] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 1 failures [08:02:12] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [08:03:41] PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet has 2 failures [08:03:51] PROBLEM - Disk space on fluorine is CRITICAL: DISK CRITICAL - free space: /a 135873 MB (3% inode=99%) [08:04:00] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [08:05:30] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [08:05:32] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [08:05:51] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [08:07:32] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:07:51] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:08:24] (03PS1) 10Alexandros Kosiaris: postgres: Specify extra module in RSpec [puppet] - 10https://gerrit.wikimedia.org/r/295327 [08:08:47] 06Operations, 10Gerrit, 06Release-Engineering-Team, 06WMF-Legal, and 2 others: Gerrit seemingly violates data retention guidelines - https://phabricator.wikimedia.org/T114395#2395835 (10ZhouZ) [08:09:42] (03PS3) 10Muehlenhoff: drac,icinga,ipmi: do not ensure => latest [puppet] - 10https://gerrit.wikimedia.org/r/295322 (https://phabricator.wikimedia.org/T115348) (owner: 10Dzahn) [08:10:44] (03CR) 10Muehlenhoff: [C: 032 V: 032] drac,icinga,ipmi: do not ensure => latest [puppet] - 10https://gerrit.wikimedia.org/r/295322 (https://phabricator.wikimedia.org/T115348) (owner: 10Dzahn) [08:13:48] (03CR) 10Gehel: [C: 031] postgres: Specify extra module in RSpec [puppet] - 10https://gerrit.wikimedia.org/r/295327 (owner: 10Alexandros Kosiaris) [08:16:51] RECOVERY - Disk space on fluorine is OK: DISK OK [08:17:31] PROBLEM - puppet last run on mw2075 is CRITICAL: CRITICAL: puppet fail [08:17:45] (03PS2) 10Muehlenhoff: Move dataset ferm rules into the role [puppet] - 10https://gerrit.wikimedia.org/r/294930 [08:19:41] (03CR) 10Gehel: [C: 032] postgres: Specify extra module in RSpec [puppet] - 10https://gerrit.wikimedia.org/r/295327 (owner: 10Alexandros Kosiaris) [08:21:31] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [08:23:18] (03PS3) 10Muehlenhoff: Move dataset ferm rules into the role [puppet] - 10https://gerrit.wikimedia.org/r/294930 [08:23:42] (03CR) 10Alexandros Kosiaris: "I find that ldaplist is really to blame here (i.e. why on earth grepping for a term on the output makes sense, while submitting that term " [puppet] - 10https://gerrit.wikimedia.org/r/295198 (https://phabricator.wikimedia.org/T122595) (owner: 10Muehlenhoff) [08:23:50] (03CR) 10Muehlenhoff: [C: 032 V: 032] Move dataset ferm rules into the role [puppet] - 10https://gerrit.wikimedia.org/r/294930 (owner: 10Muehlenhoff) [08:24:49] (03PS7) 10Gehel: Manage Postgresql data dir with Puppet [puppet] - 10https://gerrit.wikimedia.org/r/295227 (https://phabricator.wikimedia.org/T138092) [08:26:29] (03CR) 10Gehel: Manage Postgresql data dir with Puppet (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/295227 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [08:26:31] (03CR) 10Alexandros Kosiaris: [C: 032] fix puppet unit test for squid3 [puppet] - 10https://gerrit.wikimedia.org/r/295130 (owner: 10Maturain) [08:26:37] (03PS2) 10Alexandros Kosiaris: fix puppet unit test for squid3 [puppet] - 10https://gerrit.wikimedia.org/r/295130 (owner: 10Maturain) [08:26:43] (03CR) 10Alexandros Kosiaris: [V: 032] fix puppet unit test for squid3 [puppet] - 10https://gerrit.wikimedia.org/r/295130 (owner: 10Maturain) [08:27:20] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [08:28:01] RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:32:29] (03CR) 10Muehlenhoff: "Fine with me, personally I don't even see the need for ldaplist(1) to begin with. It's a reimplementation of some obscure Solaris tool and" [puppet] - 10https://gerrit.wikimedia.org/r/295198 (https://phabricator.wikimedia.org/T122595) (owner: 10Muehlenhoff) [08:32:36] (03Abandoned) 10Muehlenhoff: Bump the size limit for labs openldap server to 4096 [puppet] - 10https://gerrit.wikimedia.org/r/295198 (https://phabricator.wikimedia.org/T122595) (owner: 10Muehlenhoff) [08:38:55] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [08:39:38] (03CR) 10Alexandros Kosiaris: "change looks good in premise, but pcc complains on labsdb1006" [puppet] - 10https://gerrit.wikimedia.org/r/295227 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [08:42:54] (03CR) 10Alexandros Kosiaris: "I can't say I love ldaplist(1) either. I only regard it as a somewhat friendlier interface to LDAP than ldapsearch. Regardless of the tool" [puppet] - 10https://gerrit.wikimedia.org/r/295198 (https://phabricator.wikimedia.org/T122595) (owner: 10Muehlenhoff) [08:45:47] (03CR) 10Muehlenhoff: "cc @abogott who made the original request/use case in T122595" [puppet] - 10https://gerrit.wikimedia.org/r/295198 (https://phabricator.wikimedia.org/T122595) (owner: 10Muehlenhoff) [08:46:17] (03CR) 10Alexandros Kosiaris: [C: 032] otrs: add check_procs for clamd/freshclam [puppet] - 10https://gerrit.wikimedia.org/r/294939 (https://phabricator.wikimedia.org/T137188) (owner: 10Faidon Liambotis) [08:46:23] (03PS2) 10Alexandros Kosiaris: otrs: add check_procs for clamd/freshclam [puppet] - 10https://gerrit.wikimedia.org/r/294939 (https://phabricator.wikimedia.org/T137188) (owner: 10Faidon Liambotis) [08:46:35] RECOVERY - puppet last run on mw2075 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:46:43] (03CR) 10Alexandros Kosiaris: [V: 032] otrs: add check_procs for clamd/freshclam [puppet] - 10https://gerrit.wikimedia.org/r/294939 (https://phabricator.wikimedia.org/T137188) (owner: 10Faidon Liambotis) [08:46:47] 06Operations, 10DBA: m1-master switch from db1001 to db1016 - https://phabricator.wikimedia.org/T106312#2395891 (10fgiunchedi) >>! In T106312#2395761, @akosiaris wrote: >>>! In T106312#2394393, @jcrespo wrote: >> @Akosiaris @MoritzMuehlenhoff @fgiunchedi @demon @Krenair @Dzahn I intend to perform the failover... [08:47:32] 06Operations, 10DBA: m1-master switch from db1001 to db1016 - https://phabricator.wikimedia.org/T106312#2395893 (10MoritzMuehlenhoff) 17 UTC also fine with me [08:48:22] (03PS2) 10Muehlenhoff: package_builder: Add gobject-introspection to package list [puppet] - 10https://gerrit.wikimedia.org/r/295226 [08:48:52] (03CR) 10Muehlenhoff: [C: 032 V: 032] package_builder: Add gobject-introspection to package list [puppet] - 10https://gerrit.wikimedia.org/r/295226 (owner: 10Muehlenhoff) [08:49:25] PROBLEM - puppet last run on mw2123 is CRITICAL: CRITICAL: puppet fail [08:58:14] (03PS8) 10Gehel: Manage Postgresql data dir with Puppet [puppet] - 10https://gerrit.wikimedia.org/r/295227 (https://phabricator.wikimedia.org/T138092) [09:00:01] 06Operations, 10DBA: m1-master switch from db1001 to db1016 - https://phabricator.wikimedia.org/T106312#2395900 (10jcrespo) Let's go with [[ https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160622T1500 | 17 UTC ]] instead [09:02:54] (03CR) 10Gehel: "Puppet compiler now looks good https://puppet-compiler.wmflabs.org/3152/" [puppet] - 10https://gerrit.wikimedia.org/r/295227 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [09:12:17] (03CR) 10DCausse: "I agree with Erik, the regexes are too complex imo and really hard to determine if a node is considered or not." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/294918 (owner: 10Gehel) [09:12:19] (03CR) 10Alexandros Kosiaris: [C: 031] Manage Postgresql data dir with Puppet [puppet] - 10https://gerrit.wikimedia.org/r/295227 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [09:15:03] (03CR) 10Gehel: "Puppet compiler now fails on restbase1001.eqiad.wmnet. But the production catalogue also fails, so the issue is probably outside of this c" [puppet] - 10https://gerrit.wikimedia.org/r/295123 (https://phabricator.wikimedia.org/T137422) (owner: 10Nicko) [09:18:34] RECOVERY - puppet last run on mw2123 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:21:53] (03PS1) 10Jcrespo: Depool db1068; repool db1070 as api [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295328 [09:22:39] !log rolling reboot of logstash cluster to Linux 4.4 [09:22:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:23:16] (03CR) 10Jcrespo: [C: 032] Depool db1068; repool db1070 as api [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295328 (owner: 10Jcrespo) [09:25:06] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1068; repool db1070 and db1071 as api (duration: 00m 27s) [09:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:27:14] PROBLEM - BGP status on cr2-ulsfo is CRITICAL: BGP CRITICAL - AS2914/IPv4: Active, AS2914/IPv6: Active [09:27:44] PROBLEM - NTP on labsdb1008 is CRITICAL: NTP CRITICAL: Offset unknown [09:28:59] mmm [09:29:51] I was worring that meant network/load/down problems, but it does not [09:31:42] (03PS3) 10Gehel: Configuration for new elasticsearch servers in eqiad. [puppet] - 10https://gerrit.wikimedia.org/r/294918 [09:34:14] (03PS9) 10Gehel: Manage Postgresql data dir with Puppet [puppet] - 10https://gerrit.wikimedia.org/r/295227 (https://phabricator.wikimedia.org/T138092) [09:35:40] (03CR) 10Gehel: [C: 032] Manage Postgresql data dir with Puppet [puppet] - 10https://gerrit.wikimedia.org/r/295227 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [09:39:38] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [09:40:49] (03CR) 10DCausse: [C: 031] Configuration for new elasticsearch servers in eqiad. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/294918 (owner: 10Gehel) [09:43:24] !log lowering disk high watermark to rebalance elasticsearch eqiad cluster disk space [09:43:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [09:46:08] RECOVERY - NTP on labsdb1008 is OK: NTP OK: Offset -0.00455057621 secs [09:51:29] 06Operations, 10DBA: Decomission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#2395956 (10jcrespo) a:03jcrespo [09:52:35] 06Operations, 10DBA: Upgrade db1022, which has an older kernel - https://phabricator.wikimedia.org/T101516#2395963 (10jcrespo) [09:53:16] (03CR) 10Nicko: Include a cassandra::instance::monitoring class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/295123 (https://phabricator.wikimedia.org/T137422) (owner: 10Nicko) [09:54:02] 06Operations, 10Gerrit, 06Release-Engineering-Team, 06WMF-Legal, and 2 others: Gerrit seemingly violates data retention guidelines - https://phabricator.wikimedia.org/T114395#1694145 (10hashar) Gerrit runs on ytterbium and Apache2 has a logrotate rule: ``` $ cat /etc/logrotate.d/apache2 /var/log/apache2/*... [09:54:06] 06Operations, 10Gerrit, 06Release-Engineering-Team, 06WMF-Legal, and 2 others: Gerrit seemingly violates data retention guidelines - https://phabricator.wikimedia.org/T114395#2395969 (10hashar) p:05High>03Normal [10:03:09] !log installing wget security updates on Ubuntu systems [10:03:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:11:06] PROBLEM - mediawiki-installation DSH group on mw1274 is CRITICAL: Host mw1274 is not in mediawiki-installation dsh group [10:12:07] PROBLEM - mediawiki-installation DSH group on mw1283 is CRITICAL: Host mw1283 is not in mediawiki-installation dsh group [10:12:56] RECOVERY - puppet last run on mw1282 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:13:28] the above ones are mine, new appservers, but they were silenced [10:13:51] mmm apparently expired [10:13:55] re-set downtime [10:19:02] 06Operations, 10vm-requests: eqiad/codfw: 4 VM request for prometheus - https://phabricator.wikimedia.org/T136313#2395997 (10fgiunchedi) p:05Triage>03Normal a:05Dzahn>03fgiunchedi [10:22:21] !log installing expat security updates on Ubuntu systems [10:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:24:18] RECOVERY - puppet last run on ms-be2012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:25:15] 06Operations, 10ops-codfw: ms-be2012.codfw.wmnet: slot=10 dev=sdk failed - https://phabricator.wikimedia.org/T135975#2396006 (10fgiunchedi) 05Open>03Resolved disk rebuilding [10:25:17] (03CR) 10Hashar: "I have tried the migration on my local machine and commented on T80385. In short:" [puppet] - 10https://gerrit.wikimedia.org/r/295255 (https://phabricator.wikimedia.org/T80385) (owner: 10Hashar) [10:27:20] (03PS1) 10Ema: tlsproxy: enable client/server TFO support in the kernel [puppet] - 10https://gerrit.wikimedia.org/r/295331 (https://phabricator.wikimedia.org/T108827) [10:32:05] (03PS2) 10Ema: tlsproxy: enable client/server TFO support in the kernel [puppet] - 10https://gerrit.wikimedia.org/r/295331 (https://phabricator.wikimedia.org/T108827) [10:32:09] !log reboot ms-be2003 for disk ordering - T137785 [10:32:10] T137785: ms-be2003.codfw.wmnet: slot=4 dev=sde failed - https://phabricator.wikimedia.org/T137785 [10:32:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:40:28] (03CR) 10Hashar: "I guess we will want:" [puppet] - 10https://gerrit.wikimedia.org/r/295255 (https://phabricator.wikimedia.org/T80385) (owner: 10Hashar) [10:42:16] RECOVERY - puppet last run on ms-be2003 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures [10:43:15] 06Operations, 10ops-codfw: ms-be2003.codfw.wmnet: slot=4 dev=sde failed - https://phabricator.wikimedia.org/T137785#2396028 (10fgiunchedi) 05Open>03Resolved disk rebuilding [10:44:06] (03CR) 10Alexandros Kosiaris: network: add $production_networks (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [10:46:56] (03PS31) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [10:56:49] (03CR) 10Muehlenhoff: tlsproxy: enable client/server TFO support in the kernel (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/295331 (https://phabricator.wikimedia.org/T108827) (owner: 10Ema) [11:01:56] RECOVERY - BGP status on cr2-ulsfo is OK: BGP OK - up: 77, down: 2, shutdown: 0 [11:05:54] 06Operations, 10Traffic, 13Patch-For-Review: Investigate TCP Fast Open for tlsproxy - https://phabricator.wikimedia.org/T108827#2396050 (10ema) So, here are a few findings so far. tshark can be used to detect SYN packets with a TFO cookie request: tshark -f 'tcp[tcpflags] & tcp-syn != 0' -Y 'tcp.options.... [11:06:01] !log reimaging db1068 [11:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:10:27] (03PS1) 10Alexandros Kosiaris: ferm: Kill INTERNAL_V4/INTERNAL_V6 definitions [puppet] - 10https://gerrit.wikimedia.org/r/295332 [11:10:29] (03PS1) 10Alexandros Kosiaris: ferm: Populate INTERNAL from network::constants [puppet] - 10https://gerrit.wikimedia.org/r/295333 [11:15:20] (03CR) 10Alexandros Kosiaris: network: add $production_networks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [11:15:29] paravoid: ^ [11:18:06] this will probably break Labs instances [11:18:54] hm, or not? would sphere => private include labs instances networks? [11:18:56] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://citoid.svc.codfw.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.codfw.wmnet:1970/api [11:20:56] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [11:22:34] what was that? [11:22:43] anomie, are you doing morning swat today? [11:25:27] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:26:19] akosiaris, mobrovac? [11:27:35] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [11:28:06] in the past, temporary issues were caused by dependency on url_downloader [11:28:15] on codfw [11:28:22] let me discard that [11:29:16] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:29:36] * akosiaris looking [11:29:47] never got the page btw [11:30:15] PROBLEM - check_mysql on fdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2198 [11:31:50] me neither [11:32:09] Unable to locate resource with pmcid PMC9999999 [11:32:21] guess which governmental database had problems again [11:32:54] jynus: did you get the page ? [11:32:58] nope [11:33:02] hmm [11:33:28] alsafi is up, BTW [11:33:51] jynus: yeah, had nothing to do with url_downloader this time around [11:34:10] it's the gov database that citoid uses to make use PCMIDs are ok [11:34:15] valid or something [11:35:12] RECOVERY - check_mysql on fdb2001 is OK: Uptime: 652776 Threads: 1 Questions: 6620000 Slow queries: 4145 Opens: 713 Flush tables: 2 Open tables: 574 Queries per second avg: 10.141 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0 [11:35:23] paravoid: break labs instances as as to what ? [11:35:32] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [11:35:58] I 've varied on $::realm on purpose to avoid exactly that [11:36:27] the one thing that might break indeed is the labs support hosts [11:36:32] akosiaris, are you saying that an external resource breaks the app, or just the check is sensible to that? [11:37:08] (03PS1) 10KartikMistry: Deploy Compact Language Links as default (Stage 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295334 (https://phabricator.wikimedia.org/T136677) [11:37:29] jynus: the service relies on an external resource to validate a given PMCID is valid. It submits a request to http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi [11:37:41] the check, being swagger based, check all endpoints [11:37:54] and submits on purpose an invalid check for a PMCID of 99999999 [11:38:03] plus/minus a few 9s [11:38:30] but when that external resource is unresponsive, the check fails and we get that [11:38:56] so it is the check that fails, not the whole application [11:39:08] 1 heuristic check [11:39:23] that is ok [11:39:55] is like the "San Francisco" header, being heuristic is ok [11:40:06] well, it should not be alerting [11:40:13] yes [11:40:27] or if it is, it should be clear why [11:40:38] can we not page when a random third-party service fails please? :) [11:40:50] and ofc there is the big question on why we are relying our monitoring on an external database [11:41:15] alerts should be actionable [11:41:25] the answer usually is "we can not know if a PMCID is invalid unless someone else tells us so" but I 've never understood why that has to be monitored [11:41:33] I would be more worried about pinging every 5 minutes an external resource [11:41:47] jynus: that is what we are effectively doing [11:41:59] akosiaris, that sems like an application-level check [11:42:06] ok to do on "user space" [11:42:16] not on "infrastructure space" [11:42:21] if that means something [11:42:27] yeah, I don't follow [11:42:50] remember that our checks are integrated to the service due to monitoring everything the swagger spec advertises [11:43:04] yes [11:43:04] which in premise is what we want [11:43:14] making sure all advertised endpoints actually work [11:43:31] but honestly, monitoring that endpoint, obviously makes no sense [11:43:35] but maybe everything should not be advertised [11:43:49] only the minimum things to say "this is up" [11:43:51] well, it's a bit more complex than that [11:43:56] I know [11:44:01] no, the premise is the other way around [11:44:06] I am not complaining to you [11:44:08] do everything [11:44:19] otherwise you have errors you don't ever catch [11:44:23] and downtimes you never catch [11:44:48] likes ORES being down 8 hours the other day and noone noticing because it responsed ok to our monitoring [11:44:57] but it would return 503 for a number of endpoints [11:45:09] ORES is in the process of fixing the swagger spec btw [11:45:19] so soon that should not be possible to happen again [11:45:22] but I digress [11:46:05] so, I think we should have a way of informing our service_checker that a specific endpoint should not be monitored [11:46:42] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:46:53] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:47:13] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:47:43] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [11:48:04] so, if a service is up (processes are running, hardware is healthy, protocol responds), an a programming error prevents a very specific code to run, why should I be paged? [11:48:17] same thing btw ^ [11:48:26] curl 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&retmode=xml&id=9999999' [11:48:30] never returns... [11:48:41] the other day I got called up for trying to add preciselly those kind of checks to icinga [11:48:52] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://citoid.svc.codfw.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.codfw.wmnet:1970/api [11:48:52] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://citoid.svc.eqiad.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.eqiad.wmnet:1970/api [11:48:59] *out [11:49:22] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [11:50:06] jynus: you are opening a long conversation here, but on premise yes I agree with you, I don't think we should be paged for something like that. [11:50:31] the actual implementation details ofc don't exist [11:50:34] but you are right [11:50:44] actually, my position is that it should page, and more checks are good, but to the right person [11:51:06] or with the right protocols [11:51:56] I was commenting on the "why should I be paged?" [11:52:05] specifically the "I" part [11:52:10] where I == ops for me [11:52:20] somebody should be paged ofc [11:52:45] speaking of which, I 've got no SMS yet [11:53:05] stalled ? never sent ? [11:53:12] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [11:53:12] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [11:53:22] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [11:53:26] yesterday lots of sms were sent [11:53:32] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [11:53:36] yup. I 've received those [11:53:38] do you want me to provoke a page? [11:53:44] lol [11:53:52] nope, I can do that on my own :-) [11:53:53] I'm serious [11:54:10] very serious [11:54:14] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [11:54:16] I actually rely on pfw on codfw to do that anytime now [11:56:31] ah, [11:56:39] so we were not meant to be paged [11:56:54] which is ok [11:56:56] the failing LVS check is the one _joe_ introduced the other day that relies on service_checker [11:57:12] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [11:57:13] which is good [11:57:38] _joe_: :) [11:58:13] ^ checking elasticsearch... [12:11:02] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [12:16:36] akosiaris: so what is "sphere => private" supposed to cover? [12:16:58] is it what it's now? half of production + all of Labs? [12:25:29] !log lowering throttling limit for index recovery on eqiad elasticsearch cluster [12:25:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:30:39] !log T136973 started cut of branch wmf/1.28.0-wmf.7 [12:30:40] T136973: MW-1.28.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T136973 [12:30:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:33:21] !log Removed Wikidata json dumps from 20160620 (inconsistent, per T138291). [12:33:22] T138291: Latest wikidata JSON dump contains unexpected sql warning - https://phabricator.wikimedia.org/T138291 [12:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:35:52] !log lowering throttling limit for index recovery on codfw elasticsearch cluster [12:35:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:36:29] !log Started a new JSON dump creation on snapshot1003 (after the last one was inconsistent, per T138291) [12:36:30] T138291: Latest wikidata JSON dump contains unexpected sql warning - https://phabricator.wikimedia.org/T138291 [12:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:49:17] !log Running extensions/Echo/maintenance/backfillReadBundles.php on all Echo-enabled wikis [12:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Mr. Obvious [12:49:35] !log Running extensions/Echo/maintenance/backfillReadBundles.php on all Echo-enabled wikis for T136368 [12:49:36] T136368: Dynamic bundle: non-bundle_base notifications need a read timestamp - https://phabricator.wikimedia.org/T136368 [12:49:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Mr. Obvious [12:55:03] PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: puppet fail [12:57:23] RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures [12:57:49] !log rolling restart of hhvm/apache in codfw for expat security update [12:57:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:11:28] !log Running extensions/Echo/maintenance/removeOrphanedEvents.php on all Echo-enabled wikis for T136425 [13:11:29] T136425: Remove orphaned echo_event rows - https://phabricator.wikimedia.org/T136425 [13:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Mr. Obvious [13:12:42] PROBLEM - mediawiki-installation DSH group on mw1286 is CRITICAL: Host mw1286 is not in mediawiki-installation dsh group [13:12:42] PROBLEM - mediawiki-installation DSH group on mw1285 is CRITICAL: Host mw1285 is not in mediawiki-installation dsh group [13:13:12] --^ these are mine [13:15:09] !log T136973 applied all security patches to 1.28.0-wmf.7 [13:15:10] T136973: MW-1.28.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T136973 [13:15:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:16:58] (03PS1) 10Hashar: Group0 to 1.28.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295339 (https://phabricator.wikimedia.org/T136973) [13:18:12] I am willing to push 1.28.0-wmf.7 to testwiki soonish but apparently there are a few things going on [13:18:15] so I will hold a bit :) [13:27:40] (03PS1) 10Jcrespo: Repool db1068 with low weight; depool db1061 and db1062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295341 [13:28:45] (03CR) 10Jcrespo: [C: 032] Repool db1068 with low weight; depool db1061 and db1062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295341 (owner: 10Jcrespo) [13:35:21] PROBLEM - Apache HTTP on mw1284 is CRITICAL: Connection refused [13:36:32] PROBLEM - puppet last run on mw1284 is CRITICAL: Connection refused by host [13:37:01] PROBLEM - salt-minion processes on mw1284 is CRITICAL: Connection refused by host [13:37:46] (03CR) 10Joal: "@Elukey: Given the talk we had and the fact that Brandon asked you to log everything, maybe this filter should be removed?" [puppet] - 10https://gerrit.wikimedia.org/r/294455 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey) [13:37:51] PROBLEM - Check size of conntrack table on mw1284 is CRITICAL: Connection refused by host [13:38:10] PROBLEM - DPKG on mw1284 is CRITICAL: Connection refused by host [13:38:30] PROBLEM - Disk space on mw1284 is CRITICAL: Connection refused by host [13:38:50] PROBLEM - MD RAID on mw1284 is CRITICAL: Connection refused by host [13:39:41] PROBLEM - configured eth on mw1284 is CRITICAL: Connection refused by host [13:40:01] PROBLEM - dhclient process on mw1284 is CRITICAL: Connection refused by host [13:40:21] PROBLEM - mediawiki-installation DSH group on mw1284 is CRITICAL: Host mw1284 is not in mediawiki-installation dsh group [13:40:41] PROBLEM - nutcracker port on mw1284 is CRITICAL: Connection refused by host [13:41:00] PROBLEM - nutcracker process on mw1284 is CRITICAL: Connection refused by host [13:41:09] this is me! new appserver [13:41:16] didn't see it on icinga till now [13:41:34] silencing [13:41:38] ah I was aobut to ask if that was mori tz [13:41:41] thanks [13:42:02] nope, I haven't done anything wrt mw1* servers yet [13:42:42] 06Operations, 06Discovery, 06Maps, 03Maps-Sprint, 13Patch-For-Review: Configure new maps servers in eqiad - https://phabricator.wikimedia.org/T138092#2388933 (10Gehel) [13:43:04] 06Operations, 06Discovery, 06Maps, 03Maps-Sprint, 13Patch-For-Review: Configure new maps servers in eqiad - https://phabricator.wikimedia.org/T138092#2388933 (10Gehel) a:03Gehel [13:43:31] PROBLEM - puppet last run on mw2192 is CRITICAL: CRITICAL: puppet fail [13:44:24] 06Operations, 06Discovery, 06Maps, 07Epic: Epic: cultivating the Maps garden - https://phabricator.wikimedia.org/T137616#2396438 (10Gehel) [13:44:26] 06Operations, 06Discovery, 06Maps, 03Maps-Sprint, 13Patch-For-Review: Rack/Setup 4 map servers in eqiad - https://phabricator.wikimedia.org/T135018#2396437 (10Gehel) 05Open>03Resolved [13:46:23] (03PS1) 10Gehel: Postgresql: init database with Puppet [puppet] - 10https://gerrit.wikimedia.org/r/295343 (https://phabricator.wikimedia.org/T138092) [13:47:44] going to live hack mediawiki-config to push 1.28.0-wmf.7 to testwiki [13:48:22] (03PS2) 10Filippo Giunchedi: svc: add graphite LVS addresses [dns] - 10https://gerrit.wikimedia.org/r/289635 (https://phabricator.wikimedia.org/T85451) [13:48:24] (03PS1) 10Filippo Giunchedi: add prometheus VMs in eqiad/codfw [dns] - 10https://gerrit.wikimedia.org/r/295344 (https://phabricator.wikimedia.org/T136313) [13:48:51] !log hashar@tin Started scap: testwiki to 1.28.0-wmf.7 T136973 [13:48:52] T136973: MW-1.28.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T136973 [13:48:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:49:04] 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Migrate labsdb1005/1006/1007 to jessie - https://phabricator.wikimedia.org/T123731#2396500 (10chasemp) p:05Triage>03Normal one thought is we have an influx of new labsdb things coming I believe. This way sort itself out w/o a lot of in-place shuffling. [13:49:21] PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [13:49:32] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [13:49:40] ^^both are me I believe [13:51:30] (03PS2) 10Filippo Giunchedi: add prometheus VMs in eqiad/codfw [dns] - 10https://gerrit.wikimedia.org/r/295344 (https://phabricator.wikimedia.org/T136313) [13:52:18] it could be me [13:52:26] (03CR) 10Gehel: "Puppet compiler is not telling much, but at least it compiles cleanly: https://puppet-compiler.wmflabs.org/3154/" [puppet] - 10https://gerrit.wikimedia.org/r/295343 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [13:52:29] (03CR) 10Hashar: "Will do the renaming tomorrow (Wednesday) during European morning." [puppet] - 10https://gerrit.wikimedia.org/r/295255 (https://phabricator.wikimedia.org/T80385) (owner: 10Hashar) [13:52:52] jynus: oh there is a db pooling change that is left undeployed [13:52:59] yes, I am with that [13:53:02] jynus: is it safe to have it synced? I am running scap right now [13:53:05] takes some time [13:53:09] !log hashar@tin scap aborted: testwiki to 1.28.0-wmf.7 T136973 (duration: 04m 17s) [13:53:10] T136973: MW-1.28.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T136973 [13:53:12] yes [13:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:53:25] almost 99% of my changes are idempotent [13:53:45] !log hashar@tin Started scap: testwiki to 1.28.0-wmf.7 (take two) T136973 [13:53:45] T136973: MW-1.28.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T136973 [13:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:54:00] well, not idempotent, but I mean they can be partially deployed/fail etc without problems [13:54:01] I have cancelled to fast :D [13:54:27] should I deploy then? [13:54:39] I guess the scap run I am handling will do it [13:54:56] it is busy rebuilding the l10ncache [13:55:05] stagging is ok right now [13:55:20] !log hashar@tin scap aborted: testwiki to 1.28.0-wmf.7 (take two) T136973 (duration: 01m 35s) [13:55:21] T136973: MW-1.28.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T136973 [13:55:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:55:24] ah wrong window [13:55:25] ... [13:55:35] !log hashar@tin Started scap: testwiki to 1.28.0-wmf.7 (take three) T136973 [13:55:35] T136973: MW-1.28.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T136973 [13:55:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:56:02] ACKNOWLEDGEMENT - cassandra CQL 10.64.0.79:9042 on maps1001 is CRITICAL: Connection refused Gehel configuration in progress [13:56:03] ACKNOWLEDGEMENT - cassandra service on maps1001 is CRITICAL: NRPE: Command check_cassandra-state not defined Gehel configuration in progress [13:56:03] ACKNOWLEDGEMENT - kartotherian endpoints health on maps1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.79, port=6533): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) Gehel configuration in progress [13:56:04] ACKNOWLEDGEMENT - puppet last run on maps1001 is CRITICAL: CRITICAL: Puppet has 7 failures Gehel configuration in progress [13:56:04] ACKNOWLEDGEMENT - tilerator on maps1001 is CRITICAL: Connection refused Gehel configuration in progress [13:56:05] ACKNOWLEDGEMENT - tileratorui on maps1001 is CRITICAL: Connection refused Gehel configuration in progress [13:57:32] Sorry for the spam, I did not think that acknowledging alerts on a host with scheduled downtime would generate noise... [13:58:27] it does, if marked "Send notification". disable notifications != create downtime period [13:59:03] e.g. if something alerts, then you downtime, then it goes up, it will notify [13:59:47] (03CR) 10Muehlenhoff: Restart exim daily on Monday to Friday (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/294929 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:59:49] jynus: thanks! I'll check that right now... [14:00:07] (03PS2) 10Muehlenhoff: Restart exim daily on Monday to Friday [puppet] - 10https://gerrit.wikimedia.org/r/294929 (https://phabricator.wikimedia.org/T135991) [14:00:46] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] add prometheus VMs in eqiad/codfw [dns] - 10https://gerrit.wikimedia.org/r/295344 (https://phabricator.wikimedia.org/T136313) (owner: 10Filippo Giunchedi) [14:02:12] !log hashar@tin scap failed: CalledProcessError Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki="labtestwiki" --outdir="/tmp/scap_l10n_2087727834" --threads=4 --lang en --quiet' returned non-zero exit status 255 (duration: 06m 37s) [14:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:03:06] !log disabling alerting for maps100?\.eqiad\.wmnet during initial installation [14:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:05:48] PROBLEM - Disk space on mw2243 is CRITICAL: Connection refused by host [14:06:18] PROBLEM - MD RAID on mw2243 is CRITICAL: Timeout while attempting connection [14:06:23] !log hashar@tin Started scap: (no message) [14:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:07:08] PROBLEM - Apache HTTP on mw2243 is CRITICAL: Connection timed out [14:07:17] PROBLEM - configured eth on mw2243 is CRITICAL: Timeout while attempting connection [14:07:38] PROBLEM - dhclient process on mw2243 is CRITICAL: Timeout while attempting connection [14:07:48] PROBLEM - mediawiki-installation DSH group on mw2243 is CRITICAL: Host mw2243 is not in mediawiki-installation dsh group [14:08:08] PROBLEM - nutcracker port on mw2243 is CRITICAL: Timeout while attempting connection [14:08:29] PROBLEM - nutcracker process on mw2243 is CRITICAL: Timeout while attempting connection [14:08:47] PROBLEM - puppet last run on mw2243 is CRITICAL: Timeout while attempting connection [14:08:49] have to quickly rush to school. be back in 6 minutes [14:08:58] wikiversions.json is live hacked to push .7 to testwiki [14:09:01] and scap going on [14:09:08] PROBLEM - salt-minion processes on mw2243 is CRITICAL: Timeout while attempting connection [14:09:22] !log hashar@tin scap failed: CalledProcessError Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki="labtestwiki" --outdir="/tmp/scap_l10n_87423667" --threads=4 --lang en --quiet' returned non-zero exit status 255 (duration: 02m 58s) [14:09:27] RECOVERY - puppet last run on mw2192 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:09:33] 2243 is mine [14:09:37] new appserver [14:09:39] PROBLEM - Check size of conntrack table on mw2243 is CRITICAL: Timeout while attempting connection [14:09:46] wow I started the install this morning :/ [14:09:57] PROBLEM - DPKG on mw2243 is CRITICAL: Timeout while attempting connection [14:12:57] !log depooling restbase1007 for upgrade to Linux 4.4 [14:13:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:14:22] !log correction: restbase1007 was already depooled for cassandra maintenance, thus only rebooting to 4.4 [14:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:14:53] back [14:17:39] (03PS4) 10Gehel: Configuration for new elasticsearch servers in eqiad. [puppet] - 10https://gerrit.wikimedia.org/r/294918 [14:21:12] (03CR) 10EBernhardson: [C: 031] "other than david's comment looks sane to me" [puppet] - 10https://gerrit.wikimedia.org/r/294918 (owner: 10Gehel) [14:22:01] (03CR) 10DCausse: [C: 031] Configuration for new elasticsearch servers in eqiad. [puppet] - 10https://gerrit.wikimedia.org/r/294918 (owner: 10Gehel) [14:22:12] (03CR) 10Gehel: Configuration for new elasticsearch servers in eqiad. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/294918 (owner: 10Gehel) [14:25:17] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp main page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp main page via mobile-sections-lead returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/mobile-sections-remaining/{title} (retrieve remaining sections of en.wp main page via mobile-s [14:25:57] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp main page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp main page via mobile-sections-lead returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/mobile-sections-remaining/{title} (retrieve remaining sections of en.wp main page via mobile-sections-rema [14:26:18] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp main page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp main page via mobile-sections-lead returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Barack Obama page via mobile-sections-lead) i [14:28:05] I'm taking a look at mobileapps [14:28:28] !log hashar@tin Started scap: testwiki to group0 (previously was labtestwiki which does not work) [14:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:31:07] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [14:32:08] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [14:32:18] ah I'm assuming that's related to restarting restbase1007 (mobileapps 500s) [14:32:27] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [14:32:39] RECOVERY - Apache HTTP on mw1284 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.002 second response time [14:33:08] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [14:34:37] RECOVERY - Apache HTTP on mw2243 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.074 second response time [14:35:37] RECOVERY - Disk space on mw2243 is OK: DISK OK [14:35:39] RECOVERY - nutcracker port on mw2243 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [14:35:58] RECOVERY - nutcracker process on mw2243 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [14:35:59] RECOVERY - nutcracker process on mw1284 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker [14:35:59] RECOVERY - MD RAID on mw2243 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [14:36:07] RECOVERY - Check size of conntrack table on mw1284 is OK: OK: nf_conntrack is 0 % full [14:36:09] RECOVERY - Disk space on mw1284 is OK: DISK OK [14:36:37] RECOVERY - MD RAID on mw1284 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [14:36:37] RECOVERY - salt-minion processes on mw2243 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:36:48] RECOVERY - salt-minion processes on mw1284 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:36:58] RECOVERY - configured eth on mw2243 is OK: OK - interfaces up [14:37:17] RECOVERY - Check size of conntrack table on mw2243 is OK: OK: nf_conntrack is 0 % full [14:37:27] RECOVERY - configured eth on mw1284 is OK: OK - interfaces up [14:37:28] RECOVERY - dhclient process on mw2243 is OK: PROCS OK: 0 processes with command name dhclient [14:37:28] RECOVERY - DPKG on mw2243 is OK: All packages OK [14:37:47] RECOVERY - nutcracker port on mw1284 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [14:37:57] RECOVERY - dhclient process on mw1284 is OK: PROCS OK: 0 processes with command name dhclient [14:40:21] scap is on sync-masters [14:41:08] RECOVERY - DPKG on mw1284 is OK: All packages OK [14:44:20] so what was up with that mobileapps page again? [14:44:23] and why aren't we getting paged? [14:45:36] I'm looking into the former question with mobileapps logs + logstash, no idea about latter though [14:46:14] (03CR) 10Faidon Liambotis: [C: 031] ferm: Kill INTERNAL_V4/INTERNAL_V6 definitions [puppet] - 10https://gerrit.wikimedia.org/r/295332 (owner: 10Alexandros Kosiaris) [14:47:25] !log rolling restart of aqs service on aqs1001-aqs1006 to pick up new firejail settings [14:47:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:48:47] joal --^ [14:50:01] (03CR) 10Faidon Liambotis: [C: 04-1] "I still don't see how this makes sense. What does "sphere private" really mean here? Why would that combination of networks /ever/ be usef" [puppet] - 10https://gerrit.wikimedia.org/r/295333 (owner: 10Alexandros Kosiaris) [14:50:08] thcipriani, around? [14:51:12] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup/deploy new codfw mw app servers - https://phabricator.wikimedia.org/T135466#2396678 (10Papaul) [14:52:52] yurik_: yup, what's up? [14:53:04] thcipriani, hey, want to do graph spec3 later today? [14:53:12] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup/deploy new codfw mw app servers - https://phabricator.wikimedia.org/T135466#2299896 (10Papaul) a:05Papaul>03Joe OS installation complete on all the hosts puppet cert and salt-key complete as well. [14:53:56] thcipriani: good morning! scap is still going on. I have lost time trying to figure out a very lame mistake / typo :D [14:54:10] sec patches applied for sure, yurik_ is willing to get a security patch of some sort added [14:54:29] and Roan is adding a few changes to Echo but I guess he will reach a working state by the time of deployment [14:54:55] hashar: ack, thanks for taking care of all that :) [14:55:50] yurik_: yup if you're up for it, I've got the puppet part (https://gerrit.wikimedia.org/r/#/c/294357/) if you've got the graphoid ./scap dir part :) [14:56:42] syncapaches is 44% done [14:57:14] thcipriani, there is only one patch for swat - we could bug gehel to merge any needed puppet stuff [14:57:18] if he's around :) [14:57:33] * yurik_ goes to look at the scap dir thingy again [14:58:26] I only have one Echo patch, and I've got it lined up, I just need to wait for scap to finish [15:00:04] anomie, ostriches, thcipriani, marktraceur, and Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160621T1500). Please do the needful. [15:00:04] kart_: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:00:21] around [15:00:44] kart_: holding SWAT until scap is complete. [15:01:26] 06Operations, 10ops-codfw, 10hardware-requests: procure syslog hardware in codfw - https://phabricator.wikimedia.org/T138075#2396708 (10RobH) a:03RobH [15:01:33] 70% [15:01:38] well [15:01:42] at worth we can cancel the scap [15:01:56] the 30% remaining of scap to testwiki can be done again later [15:02:07] thcipriani: Sure [15:02:18] hashar: I'd rather let it complete [15:03:01] lets stream the progress on hangout https://hangouts.google.com/hangouts/_/wikimedia.org/scap :D [15:03:15] joining. [15:03:25] :) [15:03:27] that the modern way of sharing a display bar [15:03:31] a progress bar [15:03:37] 80 left [15:04:24] RECOVERY - puppet last run on mw1284 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [15:05:10] 06Operations, 06Services: mobileapps 500s following reboot of restbase1007 - https://phabricator.wikimedia.org/T138314#2396741 (10fgiunchedi) [15:07:25] cdb rebuild [15:07:33] thcipriani: kart_ really sorry about the added delay :- [15:07:47] I screwed up the first scap by switcing labtestwiki instead of testwiki [15:07:56] turns out it causes a fatal eventually :D [15:08:50] hashar: no worries at all. [15:09:22] hashar: hmm, weird, but probably a good thing. I like the hangout! [15:09:25] thcipriani, poke me when you want to play with the graph stuff [15:09:32] i'm getting it ready in the mean time [15:09:49] yurik_: ack, sounds good, I'll be ready post-SWAT most likely :) [15:12:16] I'd like to remind you that my change is merged but not deployed [15:12:28] (03PS1) 10Elukey: Add new MW appservers to the scap DSH list. [puppet] - 10https://gerrit.wikimedia.org/r/295353 [15:12:30] yurik_: sorry, afk atm. You need me to review / merge something? [15:12:44] RECOVERY - puppet last run on mw2243 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [15:12:56] gehel, we want to switch graphoid service to scap3, same as before [15:15:53] PROBLEM - Apache HTTP on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:16:53] PROBLEM - puppet last run on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:16:54] PROBLEM - HHVM rendering on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:17:03] thcipriani: I dont know what is wrong with the last host [15:17:24] PROBLEM - Check size of conntrack table on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:17:34] PROBLEM - SSH on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:17:49] (03CR) 10Elukey: [C: 032] Add new MW appservers to the scap DSH list. [puppet] - 10https://gerrit.wikimedia.org/r/295353 (owner: 10Elukey) [15:17:53] PROBLEM - salt-minion processes on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:17:54] PROBLEM - configured eth on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:18:11] hashar: looks like mw1131 is having a bad time [15:18:13] PROBLEM - dhclient process on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:18:33] PROBLEM - nutcracker port on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:18:34] PROBLEM - DPKG on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:18:43] hashar: you should login in a new term and kill 26562 [15:18:45] PROBLEM - nutcracker process on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:18:54] PROBLEM - HHVM processes on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:19:08] (that's the process that's running ssh mw1131 scap cdb-rebuild) [15:19:24] should I abort ? [15:19:40] no, open a new term and just kill 26562 [15:19:56] what's the timeout on those? [15:19:57] done [15:20:13] RECOVERY - configured eth on mw1131 is OK: OK - interfaces up [15:20:14] !log hashar@tin Finished scap: testwiki to group0 (previously was labtestwiki which does not work) (duration: 51m 45s) [15:20:14] RECOVERY - dhclient process on mw1131 is OK: PROCS OK: 0 processes with command name dhclient [15:20:18] 15:19:54 sudo -u mwdeploy -n -- /usr/bin/scap cdb-rebuild on mw1131.eqiad.wmnet returned [143]: l10n merge: 0% (ok: 0; fail: 0; left: 393) [15:20:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:20:24] thcipriani: kart_ SWAT open [15:20:25] sorry :( [15:20:29] hashar: cool, thanks :) [15:20:34] RECOVERY - nutcracker port on mw1131 is OK: TCP OK - 0.000 second response time on port 11212 [15:20:44] RECOVERY - DPKG on mw1131 is OK: All packages OK [15:20:54] RECOVERY - nutcracker process on mw1131 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker [15:21:03] RECOVERY - HHVM processes on mw1131 is OK: PROCS OK: 6 processes with command name hhvm [15:21:14] RECOVERY - puppet last run on mw1131 is OK: OK: Puppet is currently enabled, last run 35 minutes ago with 0 failures [15:21:45] RECOVERY - Check size of conntrack table on mw1131 is OK: OK: nf_conntrack is 0 % full [15:21:54] RECOVERY - SSH on mw1131 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7 (protocol 2.0) [15:22:15] RECOVERY - salt-minion processes on mw1131 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [15:23:42] this one is not a new appserver --^ [15:23:58] ah ok just read the backlog [15:24:45] thcipriani: we can start with test hosts, but you can review the patch and determine if that's needed. we don't have dblist today. [15:25:36] kart_: ack. give me 1 second to do some quick cleanup. [15:25:45] looks like scap killed mw1131 somehow [15:25:49] anyway testwiki is switched [15:25:50] all set [15:25:59] thcipriani: I am rushing out ! have safe deploy! [15:26:13] dapatrick might have some more patches to apply [15:26:29] hashar: thank you! [15:27:51] None for me this week, unless we have some emergency bug reports. [15:28:05] hashar ^^ [15:29:03] (03PS1) 10Faidon Liambotis: openldap: enable the memberof overlay [puppet] - 10https://gerrit.wikimedia.org/r/295357 [15:29:06] moritzm: hey [15:29:59] yurik_: I'll be available in 5-10'... [15:30:12] thx gehel ! [15:30:17] i'm getting the patches ready [15:30:37] kart_: ok, let's get started :) [15:31:54] (03PS2) 10Thcipriani: Deploy Compact Language Links as default (Stage 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295334 (https://phabricator.wikimedia.org/T136677) (owner: 10KartikMistry) [15:32:25] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295334 (https://phabricator.wikimedia.org/T136677) (owner: 10KartikMistry) [15:32:59] yurik_: I'm here, sorry for the delay [15:33:01] (03Merged) 10jenkins-bot: Deploy Compact Language Links as default (Stage 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295334 (https://phabricator.wikimedia.org/T136677) (owner: 10KartikMistry) [15:33:26] thcipriani: sure [15:33:43] gehel, no worries, still taking a few minutes to set up the scap3 patches for graphoid. Also, i think thcipriani has created some puppet patch for graphoid [15:34:25] jynus: want me to sync Repool db1068 with low weight; depool db1061 and db1062 ? [15:35:44] PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: puppet fail [15:35:45] RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge. [15:38:30] kart_: change is on mw1017 now [15:39:05] thcipriani: checking. [15:39:33] thcipriani: I need to enable x-mw-debug and test on specific WP we deployed, right? [15:39:42] Looks good with en.wikivoyage. [15:39:48] kart_: yup. [15:39:53] spot-check with mwrepl looks ok [15:40:14] RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:40:32] kart_: lemme know when you're ready for me to sync everywhere [15:40:40] thcipriani, either you do it or I can, when there is a hole [15:40:54] PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: puppet fail [15:41:14] thcipriani: few seconds more, checking negative test. [15:41:19] thcipriani, https://gerrit.wikimedia.org/r/295358 [15:41:53] jynus: syncing now [15:42:16] !log thcipriani@tin Synchronized wmf-config/db-eqiad.php: Repool db1068 with low weight; depool db1061 and db1062 (duration: 00m 30s) [15:42:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:42:41] thank you, I would say "checking", but I do that all the time [15:42:54] RECOVERY - mediawiki-installation DSH group on mw1284 is OK: OK [15:43:12] thcipriani: go ahead. [15:43:24] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [15:44:35] kart_: doing. [15:44:50] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:295334|Deploy Compact Language Links as default (Stage 1)]] (duration: 00m 25s) [15:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:44:54] ^ kart_ check please [15:46:52] thcipriani: thanks. Checking. [15:47:46] thcipriani, i do want to deploy it to beta cluster, but i'm not sure the servers there are properly setup yet [15:48:17] basically deployment-graphoid.deployment-prep.eqiad.wmflabs does not exist yet :) [15:48:37] yurik_: heh, yeah, that sounds like a blocker :P [15:49:28] thcipriani, ok, merging. gehel, go ahead and enable it, if thcipriani allows? [15:49:38] s/allows/ok with it :) [15:49:56] yurik_: remind me of the context... [15:50:07] gehel, scap3 for graphoid service [15:50:32] gehel, https://gerrit.wikimedia.org/r/#/c/294357 [15:50:41] gehel: so puppet would need...yeah ^ that [15:50:54] ok, looking... [15:51:28] thcipriani: sorry, looks good. [15:51:36] kart_: np, thanks for checking :) [15:51:53] thcipriani: thanks a lot. x-mw-debug thing must be used for all. [15:52:36] yurik_: before puppet runs on the graphoid hosts you have to do: scap deploy --init (after your /deploy patch merges and is on tin) [15:53:19] !log catrope@tin Synchronized php-1.28.0-wmf.7/extensions/Echo/: (no message) (duration: 00m 33s) [15:53:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:53:30] yurik_, thcipriani: lgtm, merging [15:53:52] (03PS2) 10Gehel: Deploy Graphoid with Scap3 [puppet] - 10https://gerrit.wikimedia.org/r/294357 (owner: 10Thcipriani) [15:53:57] thcipriani, doing it now [15:55:03] thcipriani, done [15:55:31] (03CR) 10Gehel: [C: 032] Deploy Graphoid with Scap3 [puppet] - 10https://gerrit.wikimedia.org/r/294357 (owner: 10Thcipriani) [15:55:38] cool, should be good to run puppet on graphoid targets when ^ merges [15:57:12] thcipriani, yurik_: I seem to remember needing a puppet run on tin as well (where are my notes when I need them?) [15:57:37] gehel: yup, a puppet run on tin first for housekeeping [15:58:18] !log puppet run on tin to enable scap3 deployment for graphoid [15:58:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:58:31] thcipriani: still working on mw1131 [15:58:33] ? [15:59:04] elukey: no, unless RoanKattouw has any more Echo stuff, SWAT should be complete [15:59:17] I synced my one Echo thing [15:59:19] so I'm done [16:00:04] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160621T1600). Please do the needful. [16:00:44] thcipriani: ah ok because I can still see a critical for hhvm [16:00:48] will restart it [16:01:10] thcipriani, yurik_: puppet run on tin complete [16:02:04] gehel: ack, thanks. can you run on graphoid nodes too please? [16:03:11] thcipriani: scb[12]00[12] ? [16:04:02] gehel: yup that looks correct based on yurik_ 's patch [16:04:15] thcipriani: running right now... [16:05:05] done [16:05:35] gehel: thank you! [16:05:46] thcipriani: at your service... [16:06:09] yurik_: could you spot-check scb2001 to make sure that /srv/deployment/graphoid is owned by deploy-service? [16:06:17] RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [16:06:32] * yurik_ looks [16:07:13] thcipriani: checked on scb1001, looks good [16:07:28] gehel: cool, thanks :) [16:07:36] thcipriani, yep [16:07:40] hehe [16:07:54] actually, looks good on all 4 servers [16:07:54] yurik_: feel free to pull the trigger: scap deploy -v [16:08:02] * thcipriani watches [16:08:12] * gehel crosses fingers [16:09:20] should deploy to scb2001.codfw.wmnet, restart service there, check 19000 is accepting connections, prompt to continue, then do the others [16:10:17] RECOVERY - mediawiki-installation DSH group on mw2243 is OK: OK [16:12:01] * yurik_ tries graphoid scap3 [16:16:15] finishing syncing [16:16:27] RECOVERY - mediawiki-installation DSH group on mw1285 is OK: OK [16:16:27] RECOVERY - mediawiki-installation DSH group on mw1286 is OK: OK [16:16:40] yurik_: nice :) [16:17:21] gehel: yurik_ thanks again for all the work to port to scap3, much obliged. [16:17:45] thanks thcipriani ! [16:17:47] * gehel did not do much... but gaind a bit of knowledge in the process... [16:18:27] RECOVERY - mediawiki-installation DSH group on mw1274 is OK: OK [16:19:38] RECOVERY - mediawiki-installation DSH group on mw1283 is OK: OK [16:24:53] (03PS5) 10Gehel: Configuration for new elasticsearch servers in eqiad. [puppet] - 10https://gerrit.wikimedia.org/r/294918 [16:31:10] (03CR) 10Gehel: [C: 032] Configuration for new elasticsearch servers in eqiad. [puppet] - 10https://gerrit.wikimedia.org/r/294918 (owner: 10Gehel) [16:32:05] !log starting installation of new elasticsearch server elastic1032.eqiad.wmnet [16:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:33:05] !log deployed and restarted graphoid with scap3 [16:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:35:18] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [16:38:13] (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, good job!" [puppet] - 10https://gerrit.wikimedia.org/r/291819 (owner: 10Alexandros Kosiaris) [16:52:25] (03CR) 10Filippo Giunchedi: [C: 031] "two small nits in comments, LGTM otherwise, also verified via PCC for https://gerrit.wikimedia.org/r/#/c/291819/ https://puppet-compiler.w" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [16:53:52] (03PS1) 10Urbanecm: Temporary IP Cap Lift on es.wiki and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295365 (https://phabricator.wikimedia.org/T138322) [16:56:27] Hi, can somebody deploy https://gerrit.wikimedia.org/r/#/c/295365/ for eswiki? This is throttle rule for event that's held today. See T138322 for details. [16:56:27] T138322: Temporary IP Cap Lift on es.wiki and commons - https://phabricator.wikimedia.org/T138322 [16:58:33] Ping: thcipriani Krenair anomie [16:59:54] * thcipriani looks [17:00:04] yurik, gwicke, cscott, arlolra, and subbu: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160621T1700). [17:00:05] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, 03Discovery-Search-Sprint: Install and configure new elasticsearch servers in eqiad - https://phabricator.wikimedia.org/T138329#2397068 (10Gehel) [17:00:38] no deploy today. [17:00:51] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, 03Discovery-Search-Sprint: Install and configure new elasticsearch servers in eqiad - https://phabricator.wikimedia.org/T138329#2397068 (10Gehel) Configuration of new servers was done in https://gerrit.wikimedia.org/r/#/c/294918/ (so... [17:01:40] (03PS32) 10Filippo Giunchedi: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [17:01:43] (03PS1) 10Filippo Giunchedi: syslog: limit source range to $PRODUCTION_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/295368 [17:02:07] subbu: Why? This cannot be deployed in no usual deploy window because I have no time from eswiki to schedule it. The event (which needs one of throttle rules I've added) is held today 13:00 UTC-5. This cannot be done in any SWAT (and mind that I'm in Europe). [17:02:13] 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, 03Discovery-Search-Sprint: Install and configure new elasticsearch servers in eqiad - https://phabricator.wikimedia.org/T138329#2397100 (10Gehel) elastic1032 is installed and configured. It joined the cluster without issues and is st... [17:02:38] Urbanecm, sorry .. I meant: we aren't deploying parsoid today. [17:02:56] (03CR) 10Thcipriani: [C: 032] Temporary IP Cap Lift on es.wiki and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295365 (https://phabricator.wikimedia.org/T138322) (owner: 10Urbanecm) [17:03:43] (03Merged) 10jenkins-bot: Temporary IP Cap Lift on es.wiki and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295365 (https://phabricator.wikimedia.org/T138322) (owner: 10Urbanecm) [17:03:44] subbu: Ok :) [17:04:02] Thanks for explanation [17:04:37] (03CR) 10Alexandros Kosiaris: network: add $production_networks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [17:05:01] deploying graphoid and tilerator [17:06:47] !log thcipriani@tin Synchronized wmf-config/throttle.php: [[gerrit:295365|Temporary IP Cap Lift on es.wiki and commons]] (duration: 00m 24s) [17:06:55] ^ Urbanecm [17:07:12] Thanks for your deploy thcipriani :) [17:07:26] Urbanecm: thanks for keeping up with these on short notice :) [17:07:47] You're welcome :). [17:08:05] akosiaris, godog: so, I think we should just redefine INTERNAL for these purposes [17:08:10] and not introduce production_networks [17:08:47] PROBLEM - Elasticsearch HTTPS on elastic1032 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused [17:09:05] (03CR) 10Alexandros Kosiaris: [C: 031] Postgresql: init database with Puppet [puppet] - 10https://gerrit.wikimedia.org/r/295343 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [17:10:27] (03PS33) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [17:10:30] paravoid: yeah that's probably better, no realm in the variable [17:10:48] !log deployed graphoid https://gerrit.wikimedia.org/r/#/c/295367/ [17:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:11:22] and set $internal_networks to slice_network_constants($::realm) [17:12:33] paravoid: I 'd like to kill INTERNAL tbh [17:12:51] it's badly named and easily misunderstood/misinterpreted [17:13:17] REALM_NETWORKS ? [17:13:21] well, ok, that's true [17:13:27] but production_networks won't work with labs [17:13:53] labs ? as in labs VMs ? [17:13:58] so having a realm-dependent variable which means "protected to the internal of our network, not accessible by the internet" would be useful [17:14:11] in Labs instances that happen to use ops/puppet code to set up something [17:16:05] yeah I agree internal is a bit misleading, realm_networks might do [17:16:43] so, we have various needs. For example we do want a PRODUCTION_NETWORKS and a LABS_NETWORKS structure in production [17:16:51] !log thcipriani@tin Synchronized php-1.28.0-wmf.7/extensions/Graph/lib/graph2.compiled.js: pre-train backport: [[gerrit:295366|Updated to latest graph2 lib]] (duration: 00m 31s) [17:16:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:16:56] the LABS_NETWORKS used mostly in labstores and labsdbs [17:17:20] and anything else we might want to share infra [17:18:07] so in that case REALM_NETWORKS == PRODUCTION_NETWORKS but we do need something extra as well. At which level exactly I am not sure [17:18:43] but it probably does makes sense to do it someplace somewhat central [17:18:56] yeah, for cross-realm ferm rules in production we could do REALM_NETWORKS + LABS_NETWORKS [17:19:24] do we have the reverse need in labs ? [17:19:25] I mean the labstore/labdb case is sort of production by definition as I see it [17:19:40] I don't think so, but please do correct me if I am wrong [17:19:49] I hope we are not going to access/rely labs instances from production but yeah there could be exceptions [17:20:02] not from the internal network anyway [17:20:29] ah, yes there are. CI [17:20:45] (03PS1) 10Gehel: Adding missing dependency in exposing puppet SSL certs on elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/295369 (https://phabricator.wikimedia.org/T138329) [17:21:01] well... CI from private ? or public. right now it's public, but contint1001 will not be [17:22:02] mhhh CI running in production reaching out to labs instances via the non-public labs addresses? [17:22:57] I think so at least. hashar is reworking on the CI architecture these days [17:23:21] we actually have some open questions about the new CI architecture: https://phabricator.wikimedia.org/T133300 [17:23:22] good timing at least [17:24:31] indeedly! [17:25:08] not clear if we're going to be using contint1001 since it can't reach labs instances from its network(?) [17:26:30] (03CR) 10Andrew Bogott: "Can we please just bump this to 10,000 and then forget about it for another year or two? The migration from opendj to openldap applied a " [puppet] - 10https://gerrit.wikimedia.org/r/295198 (https://phabricator.wikimedia.org/T122595) (owner: 10Muehlenhoff) [17:29:26] afaik it shouldn't via the internal labs addresses no [17:30:27] don't quote me on that though :) [17:30:31] 06Operations, 10Traffic, 10Wikimedia-Logstash: Move logstash to an LVS service - https://phabricator.wikimedia.org/T132458#2397252 (10bd808) This might be something to look at doing as part of {T138328}. [17:31:13] I agree, labs private IPs should be not be reachable from production private IPs and vice versa [17:33:23] there are servers within prod private ip space accessible from private ip space in labs, primarily labs-support vlan things and ldap servers [17:33:28] yup. so we're thinking of moving to scandium which is in the labs host network, but since we do rely on the varnish misc cache: we're not sure if we can move into that network. [17:34:19] I think that's our main open question: whether we can still be behind the varnish cache if we move to scandium. [17:35:53] 06Operations, 10ORES, 06Revision-Scoring-As-A-Service: ORES should advertise swagger specs under /?spec - https://phabricator.wikimedia.org/T137804#2397269 (10Halfak) https://github.com/wiki-ai/ores/pull/151 [17:36:56] 06Operations, 10Traffic, 10Wikimedia-Logstash: Move logstash to an LVS service - https://phabricator.wikimedia.org/T132458#2397273 (10bd808) Related: {T113104} [17:38:22] thcipriani: you want scandium on a private vlan, acccessible by labs VM's, able to access labs VM's (22 only?), and able to be behind varnish for gerrit/jenkins? [17:40:28] chasemp: that is mostly my understanding. Most of my understanding comes from hasharAway so small bits of that may not be true, but I think the broad strokes are correct. [17:41:20] thcipriani: why are we putting scandium is labs-hosts1-b-eqiad? [17:41:41] in even [17:42:04] I don't understand the question [17:42:09] it's in labs-support1 btw [17:42:39] labs-hosts is mostly for openstack inf, nodepool was included there for that reason and it has the labvirts and is also the transit network for actual labs vm's [17:42:56] labs-support is generally services we consider production that provide functionality to labs vm's [17:43:01] i.e. nfs, etc [17:43:40] so labs-support seems ideologically the right place and there isn't a reason it couldn't be behind varnish, other than it may not be setup afaik [17:43:51] I don't get why promethium is in labs-hosts [17:44:34] but it's not a good misc services or misc things vlan [17:44:44] I have to go, will read the tl;dr on tasks :) [17:50:34] it seems like crossed wires [17:50:43] https://phabricator.wikimedia.org/T133300#2380886 indicates labs-support [17:50:58] https://phabricator.wikimedia.org/T133300#2382725 hashar calls out labs host network [17:51:12] these are functionallity separate things in that there is an actual labs-support and labs-hosts [17:51:18] so I think there is confusion there [17:51:32] anyhoo [17:52:07] yeah, in our discussions there has definitely been some confusion about the network layout. I'm largely unfamiliar with these groupings. [17:53:26] chasemp: would you be able to write a bit on that task to clear up the distinction between labs-host and labs-support + the varnish info you mentioned earlier? [17:54:06] yes but I have some other various questions I think and so I need to reread from teh beginning [17:54:12] (03CR) 10Muehlenhoff: "As for the former comments wrt dropping INTERNAL, let's do that in a followup patch. Once this patch is merged I can review/change existin" [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis) [17:54:16] this is like day 2 after 2 weeks away so I'm out of hte loop [17:54:43] you missed so much CI-related fun :) [17:56:46] paravoid: godog: https://etherpad.wikimedia.org/p/realm_networks [17:57:09] I 've put an effort to approach the problem there plus some proposed solutions. Please comment [17:57:16] and now I am off for the day [17:59:29] (03PS2) 10Gehel: Postgresql: init database with Puppet [puppet] - 10https://gerrit.wikimedia.org/r/295343 (https://phabricator.wikimedia.org/T138092) [18:01:08] (03CR) 10Gehel: [C: 032] Postgresql: init database with Puppet [puppet] - 10https://gerrit.wikimedia.org/r/295343 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel) [18:02:23] (03PS1) 10BBlack: varnish: burn more cpu/mem on better gzip compression [puppet] - 10https://gerrit.wikimedia.org/r/295372 [18:04:23] (03PS1) 10Gehel: Revert "Postgresql: init database with Puppet" [puppet] - 10https://gerrit.wikimedia.org/r/295373 [18:04:31] (03CR) 10BBlack: [C: 032] varnish: burn more cpu/mem on better gzip compression [puppet] - 10https://gerrit.wikimedia.org/r/295372 (owner: 10BBlack) [18:05:00] (03CR) 10Gehel: [C: 032] "Dependency cycle issue not detected by puppet compiler, reverting" [puppet] - 10https://gerrit.wikimedia.org/r/295373 (owner: 10Gehel) [18:05:12] (03CR) 10Gehel: [V: 032] "Dependency cycle issue not detected by puppet compiler, reverting" [puppet] - 10https://gerrit.wikimedia.org/r/295373 (owner: 10Gehel) [18:06:07] (03PS2) 10Gehel: Revert "Postgresql: init database with Puppet" [puppet] - 10https://gerrit.wikimedia.org/r/295373 [18:06:23] (03CR) 10Gehel: [V: 032] Revert "Postgresql: init database with Puppet" [puppet] - 10https://gerrit.wikimedia.org/r/295373 (owner: 10Gehel) [18:23:47] (03PS1) 10Ori.livneh: Optimize mobile static images [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295374 [18:35:29] 07Blocked-on-Operations, 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: adywiki is missing the associated adywiki_p database with appropriate views - https://phabricator.wikimedia.org/T135029#2286195 (10Gehel) [18:36:07] (03CR) 10BBlack: [C: 031] "+1 for zopfli awesomeness :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295374 (owner: 10Ori.livneh) [18:37:25] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 678 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5206350 keys - replication_delay is 678 [18:37:28] (03PS1) 10BBlack: caches: tcp_notsent_lowat => 128K [puppet] - 10https://gerrit.wikimedia.org/r/295376 [18:39:34] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5150037 keys - replication_delay is 0 [18:39:59] (03CR) 10BBlack: [C: 032] caches: tcp_notsent_lowat => 128K [puppet] - 10https://gerrit.wikimedia.org/r/295376 (owner: 10BBlack) [18:41:44] PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: puppet fail [18:48:14] RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [18:50:55] !log enabled tcp_notsent_lowat optimization on all caches (marking this time for investigation of perf graphs later) - https://gerrit.wikimedia.org/r/#/c/295376/ [18:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:56:13] 07Blocked-on-Operations, 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: adywiki and jamwiki are missing the associated *_p databases with appropriate views - https://phabricator.wikimedia.org/T135029#2397393 (10MaxSem) [19:00:04] hashar: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160621T1900). [19:02:32] train time! [19:02:54] ____ [19:02:54] _||__| | ______ ______ ______ [19:02:54] ( | | | | | | | [19:02:56] /-()---() ~ ()--() ~ ()--() ~ ()--() [19:06:16] 06Operations, 10ops-codfw, 10hardware-requests: procure syslog hardware in codfw - https://phabricator.wikimedia.org/T138075#2397399 (10RobH) So all of the spare hardware currently in codfw far exceeds that of lithium.eqiad.wmnet. lithium: lithium is a Central syslog server (role::syslog::centralserver) Sin... [19:06:23] :o [19:08:07] (03PS1) 10Thcipriani: Group0 to 1.28.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295380 [19:11:50] (03CR) 10Thcipriani: [C: 032] Group0 to 1.28.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295380 (owner: 10Thcipriani) [19:12:27] (03Merged) 10jenkins-bot: Group0 to 1.28.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295380 (owner: 10Thcipriani) [19:14:12] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to 1.28.0-wmf.7 [19:14:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:14:25] PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.122:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.0.122, port=9200): Read timed out. (read timeout=4) [19:17:35] logstash1001 OOMed :/ [19:17:43] !log Restarted ElasticSearch on logstash1001; dead from OOM [19:17:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:18:13] I think I caused it dcausse. I was pointing a Kibana4 instance at it for testing [19:18:19] !log thcipriani@tin Purged l10n cache for 1.28.0-wmf.5 [19:18:34] RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 49, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 147, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_sh [19:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:18:50] bd808: you mean queries from kibana4 killed elastic? [19:19:01] it looks like it, yes [19:19:06] doh... :/ [19:20:11] I haven't played with kibana4 for about a year but it seems to be just as gross as I remembered [19:22:54] PROBLEM - logstash process on logstash1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 998 (logstash), command name java, args logstash [19:23:12] yuck. what's busted now? [19:25:03] RECOVERY - logstash process on logstash1001 is OK: PROCS OK: 1 process with UID = 998 (logstash), command name java, args logstash [19:25:33] bd808: logstash stops by itself if it fails too many times on elastic? [19:26:16] dcausse: apparently. And it looks like our systemd script for it doesn't start it back up [19:26:23] I thought we had fixed that [19:27:06] !log Restarted dead logstash process on logstash1001. Looks to have stopped itself due to the the Elasticsearch OOM earlier [19:27:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:31:33] ps -edf | grep java [19:32:04] PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master). [19:32:17] (03CR) 10Tim Landscheidt: "My most common use case is to look up the shell user name for a given wiki user name ("Tim Landscheidt"/"Tim_Landscheidt" => "scfc"). I c" [puppet] - 10https://gerrit.wikimedia.org/r/295198 (https://phabricator.wikimedia.org/T122595) (owner: 10Muehlenhoff) [19:33:06] logstash is in status "active, running"? [19:33:23] it takes a bit to complete it's startuo, though [19:34:28] moritzm: when the icinga alert went off it was "Active: inactive (dead) since Tue 2016-06-21 19:17:30 UTC; 6min ago" [19:34:45] so I did service stop && service start on it [19:35:20] ok, so apparentty it tried to restart itself, but that failed with an I/O exception [19:35:26] (03PS1) 10BBlack: stream.wm.o: drop all DNS TTLs to 5m [dns] - 10https://gerrit.wikimedia.org/r/295384 (https://phabricator.wikimedia.org/T134871) [19:35:28] (03PS1) 10BBlack: stream.wm.o: move to cache_misc in DNS [dns] - 10https://gerrit.wikimedia.org/r/295385 (https://phabricator.wikimedia.org/T134871) [19:35:57] at least it logs multiple org.apache.http.impl.execchain.RetryExec execute log lines in journalctl [19:36:36] so it seems systemd correctly tried to respawn, but that failed [19:36:41] (03CR) 10BBlack: [C: 032] stream.wm.o: drop all DNS TTLs to 5m [dns] - 10https://gerrit.wikimedia.org/r/295384 (https://phabricator.wikimedia.org/T134871) (owner: 10BBlack) [19:39:19] I'd say let's make a task of it, it does sound like a bug [19:40:05] I'll write one up [19:47:43] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [19:48:30] (03CR) 10BBlack: "Just being pedantic, but something seems off with the %diff calculations. How can a file's size be reduced by more than 100%?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295374 (owner: 10Ori.livneh) [19:52:06] moritzm: filed as T138345 [19:52:07] T138345: Systemd unit did not restart logstash process that died for Elasticsearch connection failures - https://phabricator.wikimedia.org/T138345 [19:57:04] (03CR) 10Platonides: "That's a very good point, BBlack." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295374 (owner: 10Ori.livneh) [19:57:20] * Platonides is pedantic, too [20:00:20] (03CR) 10Ori.livneh: "I made two mistakes: I calculated percent difference instead of percent change, and I expressed percent difference as a negative number, w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295374 (owner: 10Ori.livneh) [20:01:11] just don't tell anyone [20:06:03] PROBLEM - puppet last run on elastic2001 is CRITICAL: CRITICAL: puppet fail [20:12:28] (03PS2) 10Ori.livneh: Optimize mobile static images [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295374 [20:13:19] Platonides: better? [20:13:52] heh [20:14:25] what was percent difference? [20:17:49] percent difference is a way of expressing the difference between two values when there is no direction of change (before/after), just two values that mean the same thing [20:17:58] it's 100 * |a - b| / ((a + b) * 2) [20:18:36] it's not useful in this case [20:18:55] not at all [20:19:11] (and still, the numbers don't match :P) [20:19:13] nope, whoever thought so is a careless idiot [20:19:32] * Platonides gives ori a ^ [20:22:07] / 2, not * 2 [20:22:53] to sum: i used the wrong metric, presented it in the wrong way, and then defined it incorrectly [20:22:55] ah [20:22:57] lol [20:23:11] sorry ori, you can't be perfect everyday ;) [20:23:29] only 104% of the time [20:23:44] thanks for pointing it out :) /me lunches [20:24:07] it was bblack who spotted it [20:24:13] I then got intrigued about it [20:26:07] https://vimeo.com/4435893 [20:30:54] PROBLEM - puppet last run on wtp2015 is CRITICAL: CRITICAL: puppet fail [20:32:03] RECOVERY - puppet last run on elastic2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:56:52] RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [21:08:07] PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] [21:16:50] 06Operations, 06Release-Engineering-Team, 07Developer-notice, 05Gitblit-Deprecate, and 2 others: Redirect Gitblit urls (git.wikimedia.org) -> Diffusion urls (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2397757 (10greg) email sent. The countdown begins :) [21:18:25] 07Blocked-on-Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 05Gerrit-Migration, and 2 others: Package xhpast (libphutil) - https://phabricator.wikimedia.org/T137770#2397758 (10mmodell) 05Open>03Resolved [21:25:52] 06Operations, 06Release-Engineering-Team, 07Developer-notice, 05Gitblit-Deprecate, and 2 others: Redirect Gitblit urls (git.wikimedia.org) -> Diffusion urls (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2397765 (10Paladox) @greg thanks :) [21:31:38] 07Blocked-on-Operations, 05Continuous-Integration-Scaling, 13Patch-For-Review, 07WorkType-NewFunctionality: Attempt to provide a Trusty image for Nodepool - https://phabricator.wikimedia.org/T133203#2397768 (10greg) [21:35:30] 06Operations, 10Continuous-Integration-Infrastructure, 10Nodepool, 10Phabricator, and 3 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2397825 (10greg) [21:37:02] RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [21:52:18] (03PS1) 10Smalyshev: Prepare scap3 deployment for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/295437 (https://phabricator.wikimedia.org/T129144) [21:53:35] (03CR) 10jenkins-bot: [V: 04-1] Prepare scap3 deployment for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/295437 (https://phabricator.wikimedia.org/T129144) (owner: 10Smalyshev) [21:56:00] (03PS3) 10Ori.livneh: Optimize mobile static images [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295374 [21:56:17] (03CR) 10Ori.livneh: [C: 032] Optimize mobile static images [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295374 (owner: 10Ori.livneh) [21:56:53] (03Merged) 10jenkins-bot: Optimize mobile static images [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295374 (owner: 10Ori.livneh) [21:58:48] 07Blocked-on-Operations, 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: adywiki and jamwiki are missing the associated *_p databases with appropriate views - https://phabricator.wikimedia.org/T135029#2397893 (10Gehel) Summary of a discussion with @ori: The maintain-replicas script creates a new sch... [22:00:08] (03PS2) 10Smalyshev: Prepare scap3 deployment for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/295437 (https://phabricator.wikimedia.org/T129144) [22:00:55] 06Operations: setup syslog server in codfw - https://phabricator.wikimedia.org/T138073#2397902 (10RobH) [22:00:57] 06Operations, 10ops-codfw, 10hardware-requests: procure syslog hardware in codfw - https://phabricator.wikimedia.org/T138075#2397898 (10RobH) 05Open>03stalled I'm stalling this task for #procurement T138353. I'll gather pricing info on that task, and present the various options for review. [22:01:18] (03CR) 10jenkins-bot: [V: 04-1] Prepare scap3 deployment for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/295437 (https://phabricator.wikimedia.org/T129144) (owner: 10Smalyshev) [22:06:56] (03PS3) 10Smalyshev: Prepare scap3 deployment for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/295437 (https://phabricator.wikimedia.org/T129144) [22:08:31] !log ori@tin Synchronized static/images/mobile: I8f09e825: Optimize mobile static images (duration: 00m 34s) [22:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:09:22] RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge. [22:11:23] 06Operations, 06Discovery, 10Elasticsearch, 03Discovery-Search-Sprint, 13Patch-For-Review: Install and configure new elasticsearch servers in eqiad - https://phabricator.wikimedia.org/T138329#2397914 (10EBernhardson) [22:14:15] 06Operations, 06Discovery, 10Elasticsearch, 10Wikimedia-Logstash, 03Discovery-Search-Sprint: Logstash elasticsearch mapping does not allow err.code to be a string - https://phabricator.wikimedia.org/T137400#2397921 (10EBernhardson) [22:14:16] (03CR) 10BBlack: [C: 031] tlsproxy: enable client/server TFO support in the kernel [puppet] - 10https://gerrit.wikimedia.org/r/295331 (https://phabricator.wikimedia.org/T108827) (owner: 10Ema) [22:19:44] !log Backfilled missing 2016-06-20 data to https://tools.wmflabs.org/sal/production?d=2016-06-20 [22:19:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:23:02] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 648 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5149571 keys - replication_delay is 648 [22:34:02] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5147613 keys - replication_delay is 0 [22:37:56] 07Blocked-on-Operations, 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: adywiki and jamwiki are missing the associated *_p databases with appropriate views - https://phabricator.wikimedia.org/T135029#2397950 (10Krenair) I agree, although it needs to be a separate ticket and I don't think we can just... [23:00:04] RoanKattouw, ostriches, Krenair, MaxSem, and Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160621T2300). [23:00:52] Hello. [23:01:03] There is nothing to SWAT this night. [23:01:20] evening [23:01:23] Dereckson, i would like to swat my own services [23:01:26] (03CR) 10Thcipriani: "Looks good, also need to remove wdqs/wdqs from hieradata/common/role/deployment.yaml and that should be it!" [puppet] - 10https://gerrit.wikimedia.org/r/295437 (https://phabricator.wikimedia.org/T129144) (owner: 10Smalyshev) [23:01:28] i had some trouble today [23:01:55] building services, because it turns out that the wonderful russian firewall is blocking the AWS ! [23:02:05] and i couldn't build the depl packages :( [23:02:12] Annoying. [23:02:14] You can probably request VPN access [23:02:15] * yurik is not frustrated... [23:02:22] 5+ hours wasted [23:02:36] Krenair, i finally did [23:02:48] its figuring out that i am being blocked that took some time! [23:03:07] because a minor script deep inside the build system by a 3rd party was failing :( [23:03:21] and it was falling back onto the local build, which was also now working [23:03:35] bleh, anyway, if nooone is deploying, i wll depl kartotherian & tilerator [23:03:46] unless there are some objections [23:05:51] !log deleted localuser rows for Mahir256@orwikisource and A879071@enwiki for T119736 [23:05:52] T119736: Could not find local user data for {Username}@{wiki} - https://phabricator.wikimedia.org/T119736 [23:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:06:11] (03PS1) 10EBernhardson: Duplicate logstash output to alternate elasticsearch cluster [puppet] - 10https://gerrit.wikimedia.org/r/295442 [23:06:21] Well I'm certainly not objecting. I'm not really familiar with those services anyway although I think the second one is maps-related [23:07:36] Krenair, they both are ;) [23:07:52] * yurik loves when users are being deleted from db by hand [23:08:21] especially because the wonderful sql's DELETE Blah means delete everything [23:08:42] yeah, well... I'm sure the data makes more sense after it's done than before [23:09:20] and yeah, those DELETE .. WHERE clauses are not something you'd want to screw up on production master DB servers, that's for sure :) [23:11:00] Hi tgr. That remembers me I've a funny other issue: a sessionfailure message when I tried to mark patrolled an hidden Flow diff: ?title=Topic:...&action=markpatrolled&rcid=... Do you think that's for AuthManager or only for Flow? [23:12:31] (03CR) 10BryanDavis: [C: 031] "LGTM. Should test via cherry-pick on deployment-puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/295442 (owner: 10EBernhardson) [23:13:43] * yurik loves the new scap3! [23:13:53] (03PS1) 10BBlack: cache_upload: experiment with higher fe hfp cutoff [puppet] - 10https://gerrit.wikimedia.org/r/295443 [23:14:15] !log updated/restarted kartotherian & tilerator - https://gerrit.wikimedia.org/r/#/c/295440/ https://gerrit.wikimedia.org/r/#/c/295441/ [23:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:14:39] (03CR) 10BBlack: [C: 032 V: 032] cache_upload: experiment with higher fe hfp cutoff [puppet] - 10https://gerrit.wikimedia.org/r/295443 (owner: 10BBlack) [23:28:00] (03PS2) 10EBernhardson: Duplicate logstash output to alternate elasticsearch cluster [puppet] - 10https://gerrit.wikimedia.org/r/295442 [23:33:16] 06Operations, 06Community-Liaisons, 10Wikimedia-Mailing-lists: mailman maint window 2016-06-21 16:00 - 18:00 UTC - https://phabricator.wikimedia.org/T138228#2398047 (10RobH) I neglected to update this task yesterday. So the maint window is delayed until AFTER wikimania. There is an ongoing discussion with... [23:35:40] (03PS3) 10EBernhardson: Duplicate logstash output to alternate elasticsearch cluster [puppet] - 10https://gerrit.wikimedia.org/r/295442