[00:17:04] <icinga-wm>	 RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0]
[01:09:45] <icinga-wm>	 PROBLEM - puppet last run on cp3037 is CRITICAL: CRITICAL: puppet fail
[01:36:42] <icinga-wm>	 RECOVERY - puppet last run on cp3037 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[01:50:13] <icinga-wm>	 PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[02:04:19] <logmsgbot>	 !log l10nupdate@tin LocalisationUpdate failed (1.28.0-wmf.6) at 2016-06-21 02:04:19+00:00
[02:04:26] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[02:10:55] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Tue Jun 21 02:10:55 UTC 2016 (duration 6m 36s)
[02:11:00] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[03:02:50] <wikibugs>	 06Operations, 06Release-Engineering-Team, 07Developer-notice, 05Gitblit-Deprecate, and 2 others: Redirect Gitblit urls (git.wikimedia.org) -> Diffusion urls (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2395547 (10Dzahn) @greg i was just going to do it during Wikimania.....
[03:09:47] <wikibugs>	 06Operations, 10DBA: m1-master switch from db1001 to db1016 - https://phabricator.wikimedia.org/T106312#2395548 (10Dzahn) I can check on RT ('rt') and racktables (the other rt) on Wednesday, maybe around 18UTC but i dont worry about it since these services are just used by ops themselves.  That said, please no...
[03:11:39] <wikibugs>	 06Operations, 07Blocked-on-RelEng, 05Gitblit-Deprecate, 13Patch-For-Review: Phase out antimony.wikimedia.org (git.wikimedia.org / gitblit) - https://phabricator.wikimedia.org/T123718#2395549 (10Dzahn) @Paladox this is about decom'ing antimony after gitblit is gone. slightly different. but i will take it
[03:11:48] <wikibugs>	 06Operations, 07Blocked-on-RelEng, 05Gitblit-Deprecate, 13Patch-For-Review: Phase out antimony.wikimedia.org (git.wikimedia.org / gitblit) - https://phabricator.wikimedia.org/T123718#2395550 (10Dzahn) a:03Dzahn
[03:19:30] <icinga-wm>	 PROBLEM - Disk space on elastic1024 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 80224 MB (15% inode=99%)
[03:25:50] <icinga-wm>	 RECOVERY - Disk space on elastic1024 is OK: DISK OK
[03:30:18] <wikibugs>	 07Blocked-on-Operations, 06Operations, 10Wikidata, 10Wikimedia-Language-setup, and 2 others: Create Wikipedia Jamaican - https://phabricator.wikimedia.org/T134017#2395597 (10Dzahn) 05Open>03stalled
[03:55:27] <grrrit-wm>	 (03CR) 10Dzahn: Restart exim daily on Monday to Friday (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/294929 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[03:57:34] <grrrit-wm>	 (03CR) 10Dzahn: "20after4, Chad, do you agree with this?" [puppet] - 10https://gerrit.wikimedia.org/r/295011 (owner: 10Paladox)
[04:00:46] <grrrit-wm>	 (03CR) 10Dzahn: "i see just one merged commit but on "stable" not "wmf/stable"?" [puppet] - 10https://gerrit.wikimedia.org/r/293818 (owner: 1020after4)
[04:07:24] <wikibugs>	 06Operations, 07Puppet, 13Patch-For-Review: Reconsider the aligning arrows puppet lint - https://phabricator.wikimedia.org/T137763#2378054 (10Dzahn) Please keep this check. I have fixed ALL of these across the entire repo before letting it vote. That was a lot of work and it has already been done.
[04:19:59] <icinga-wm>	 PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0]
[04:20:11] <wikibugs>	 06Operations, 10Mail, 10OTRS: E-mail incorrectly forwarded to wm-cz OTRS e-mail - https://phabricator.wikimedia.org/T129743#2114650 (10Dzahn) What is the proposed change here?  wikimedia.cz is not controlled by WMF but by WMCZ   http://www.nic.cz/whois/?d=wikimedia.cz
[04:21:46] <wikibugs>	 06Operations, 06Labs, 10Labs-Infrastructure, 06Reading-Web-Backlog, and 2 others: https://wikitech.m.wikimedia.org/ serves wikimedia.org portal - https://phabricator.wikimedia.org/T120527#2395633 (10Dzahn)
[04:27:36] <grrrit-wm>	 (03PS1) 10Dzahn: nagios_common: delete check_http_bits command [puppet] - 10https://gerrit.wikimedia.org/r/295321 (https://phabricator.wikimedia.org/T107430) 
[04:33:42] <wikibugs>	 06Operations, 13Patch-For-Review: Remove secure.wikimedia.org - https://phabricator.wikimedia.org/T120790#1861112 (10Dzahn) I guess we should call it declined then...
[04:47:14] <grrrit-wm>	 (03PS1) 10Dzahn: drac,icinga,ipmi: do not ensure => latest [puppet] - 10https://gerrit.wikimedia.org/r/295322 (https://phabricator.wikimedia.org/T115348) 
[04:47:57] <grrrit-wm>	 (03CR) 10Dzahn: "comments on the tickets sound like we are leaning towards declining this change and want to keep the redirects indefinitely.." [puppet] - 10https://gerrit.wikimedia.org/r/257510 (https://phabricator.wikimedia.org/T120790) (owner: 10Reedy)
[05:07:55] <wikibugs>	 06Operations, 10Wikimedia-Mailing-lists: Please reset password of hackathonorganizers mailing list - https://phabricator.wikimedia.org/T137873#2382117 (10Dzahn) I ran the /var/lib/mailman/bin/change_pw command with -l hackathonorganizers and _without_ the "quiet" option, which means you should have received em...
[05:08:12] <wikibugs>	 06Operations, 10Wikimedia-Mailing-lists: Please reset password of hackathonorganizers mailing list - https://phabricator.wikimedia.org/T137873#2395651 (10Dzahn) 05Open>03Resolved a:03Dzahn
[05:16:06] <wikibugs>	 06Operations, 10ops-esams: Move cp3030+ from OE14 to OE13 in racktables - https://phabricator.wikimedia.org/T136403#2395653 (10Dzahn)
[05:34:31] <wikibugs>	 06Operations, 06Labs, 10Labs-Infrastructure: labs precise and jessie instance not accessible after provisioning - https://phabricator.wikimedia.org/T117673#2395660 (10Dzahn)
[05:35:52] <wikibugs>	 06Operations: fix up log retention on log collection/storage hosts - https://phabricator.wikimedia.org/T92839#2395662 (10Dzahn)
[05:38:37] <wikibugs>	 06Operations: fix up log retention on log collection/storage hosts - https://phabricator.wikimedia.org/T92839#1121529 (10Dzahn) also T87792  and T84618 and T114395
[05:41:30] <wikibugs>	 06Operations, 07SEO: GWT accounts - https://phabricator.wikimedia.org/T103567#2395678 (10Dzahn) 05Open>03Resolved
[05:41:55] <wikibugs>	 06Operations, 07Privacy, 07audits-data-retention: Gerrit seemingly violates data retention guidelines - https://phabricator.wikimedia.org/T114395#2395679 (10Dzahn) a:03Dzahn
[05:42:07] <wikibugs>	 06Operations, 07Privacy, 07audits-data-retention: Gerrit seemingly violates data retention guidelines - https://phabricator.wikimedia.org/T114395#1694145 (10Dzahn) p:05Normal>03High
[05:42:30] <wikibugs>	 06Operations, 06Release-Engineering-Team, 07Privacy, 07audits-data-retention: Gerrit seemingly violates data retention guidelines - https://phabricator.wikimedia.org/T114395#1694145 (10Dzahn)
[05:43:51] <wikibugs>	 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Grafana login issue for @thiemowmde - https://phabricator.wikimedia.org/T135994#2395683 (10Dzahn)
[05:44:02] <wikibugs>	 06Operations, 10Ops-Access-Requests, 10LDAP-Access-Requests, 06WMF-NDA-Requests: Grafana login issue for @thiemowmde - https://phabricator.wikimedia.org/T135994#2318389 (10Dzahn) @Robh
[05:46:08] <wikibugs>	 06Operations, 07Graphite: Grafana login issue for @thiemowmde - https://phabricator.wikimedia.org/T135994#2395686 (10Dzahn)
[05:48:01] <icinga-wm>	 PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[05:48:58] <wikibugs>	 06Operations, 10ops-codfw, 10hardware-requests: procure syslog hardware in codfw - https://phabricator.wikimedia.org/T138075#2395687 (10Dzahn)
[05:52:33] <wikibugs>	 06Operations, 10vm-requests: eqiad/codfw: 1 VM request for prometheus - https://phabricator.wikimedia.org/T136313#2395689 (10Dzahn) a:03Dzahn
[05:58:12] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 07Privacy, 07audits-data-retention: Gerrit seemingly violates data retention guidelines - https://phabricator.wikimedia.org/T114395#2395690 (10greg)
[06:10:44] <wikibugs>	 06Operations, 10Traffic, 06Community-Liaisons (Jul-Sep-2016): Help contact bot owners about the end of HTTP access to the API - https://phabricator.wikimedia.org/T136674#2343854 (10Dzahn) Brandon/Sherry asked me to contact user Paulis for his bot Fkraus because he speaks German. I mailed him in German about it.
[06:17:41] <icinga-wm>	 PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[06:19:24] <wikibugs>	 06Operations, 10Mail, 10OTRS: E-mail incorrectly forwarded to wm-cz OTRS e-mail - https://phabricator.wikimedia.org/T129743#2395712 (10Urbanecm) I know. info@wikimedia.cz is an alias which forwards all incoming mails to wm-cz@wikimedia.org. But (I don't know why) the mails ends up in info-cs@wikimedia.org. S...
[06:20:05] <wikibugs>	 06Operations, 10Mail, 10OTRS: E-mail incorrectly forwarded to wm-cz OTRS e-mail - https://phabricator.wikimedia.org/T129743#2395713 (10Urbanecm) Our config is set up correctly as you can see in the example in the description above.
[06:27:22] <wikibugs>	 06Operations, 06Release-Engineering-Team, 07Developer-notice, 05Gitblit-Deprecate, and 2 others: Redirect Gitblit urls (git.wikimedia.org) -> Diffusion urls (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2395719 (10greg) Thanks @dzahn. Just trying to set a good example :)
[06:27:34] <wikibugs>	 06Operations, 06Release-Engineering-Team, 07Developer-notice, 05Gitblit-Deprecate, and 2 others: Redirect Gitblit urls (git.wikimedia.org) -> Diffusion urls (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2395720 (10Paladox) @dzahn we can create the patch's then merge them...
[06:30:21] <icinga-wm>	 PROBLEM - puppet last run on mc2007 is CRITICAL: CRITICAL: puppet fail
[06:31:10] <icinga-wm>	 PROBLEM - puppet last run on subra is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:10] <icinga-wm>	 PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:30] <icinga-wm>	 PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:52] <icinga-wm>	 PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:32:52] <icinga-wm>	 PROBLEM - puppet last run on cp2001 is CRITICAL: CRITICAL: Puppet has 2 failures
[06:33:21] <icinga-wm>	 PROBLEM - puppet last run on mw1120 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:34:30] <wikibugs>	 06Operations, 10Mail, 10OTRS: E-mail incorrectly forwarded to wm-cz OTRS e-mail - https://phabricator.wikimedia.org/T129743#2395737 (10Matthewrbowker) OTRS looks good.  wm-cz@wikimedia.org places email into the queue chapters::wm-cz according to https://ticket.wikimedia.org/otrs/index.pl?Action=AdminSystemAd...
[06:34:40] <icinga-wm>	 PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures
[06:39:34] <wikibugs>	 06Operations, 10Mail, 10OTRS: E-mail incorrectly forwarded to wm-cz OTRS e-mail - https://phabricator.wikimedia.org/T129743#2395742 (10Dzahn) >>! In T129743#2395712, @Urbanecm wrote: > I know. info@wikimedia.cz is an alias which forwards all incoming mails to wm-cz@wikimedia.org. But (I don't know why) the m...
[06:41:21] <icinga-wm>	 RECOVERY - Apache HTTP on mw1252 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 626 bytes in 0.027 second response time
[06:41:43] <moritzm>	 !log restarted hhvm on mw1252
[06:41:45] <wikibugs>	 06Operations, 06Release-Engineering-Team, 07Developer-notice, 05Gitblit-Deprecate, and 2 others: Redirect Gitblit urls (git.wikimedia.org) -> Diffusion urls (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2395743 (10Paladox) @greg I think we can do it on the date you like...
[06:41:47] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[06:42:11] <icinga-wm>	 RECOVERY - HHVM rendering on mw1252 is OK: HTTP OK: HTTP/1.1 200 OK - 66630 bytes in 0.124 second response time
[06:42:19] <wikibugs>	 06Operations, 10Mail, 10OTRS: E-mail incorrectly forwarded to wm-cz OTRS e-mail - https://phabricator.wikimedia.org/T129743#2395744 (10Urbanecm) The mail which is in the example in the description ended up in info-cs@wikimedia.org. I can try to send a test mail to info@wikimedia.cz but I can't check where it...
[06:42:54] <grrrit-wm>	 (03CR) 1020after4: "this is actually obsolete now that we have arcanist and libphutil packaged" [puppet] - 10https://gerrit.wikimedia.org/r/293818 (owner: 1020after4)
[06:43:08] <grrrit-wm>	 (03Abandoned) 1020after4: use wmf/stable branch of arcanist and libphutil [puppet] - 10https://gerrit.wikimedia.org/r/293818 (owner: 1020after4)
[06:44:13] <wikibugs>	 06Operations, 10Mail, 10OTRS: E-mail incorrectly forwarded to wm-cz OTRS e-mail - https://phabricator.wikimedia.org/T129743#2395745 (10Urbanecm) >! In T129743#2395742, @Dzahn wrote: > I checked the exim alias file that is under control of operations but there is no wm-cz@ in there. This seems to be all handl...
[06:46:25] <wikibugs>	 06Operations, 06Release-Engineering-Team, 07Developer-notice, 05Gitblit-Deprecate, and 2 others: Redirect Gitblit urls (git.wikimedia.org) -> Diffusion urls (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2395749 (10greg) >>! In T137224#2395743, @Paladox wrote: > @greg I t...
[06:49:48] <wikibugs>	 06Operations, 10Mail, 10OTRS: E-mail incorrectly forwarded to wm-cz OTRS e-mail - https://phabricator.wikimedia.org/T129743#2395750 (10Matthewrbowker) >>! In T129743#2395744, @Urbanecm wrote: >>>! In T129743#2395737, @Matthewrbowker wrote: >> OTRS looks good.  wm-cz@wikimedia.org places email into the queue...
[06:51:06] <wikibugs>	 06Operations, 06Release-Engineering-Team, 07Developer-notice, 05Gitblit-Deprecate, and 2 others: Redirect Gitblit urls (git.wikimedia.org) -> Diffusion urls (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2395751 (10greg) Scheduled: https://wikitech.wikimedia.org/wiki/Depl...
[06:51:09] <wikibugs>	 06Operations, 10Mail, 10OTRS: E-mail incorrectly forwarded to wm-cz OTRS e-mail - https://phabricator.wikimedia.org/T129743#2395752 (10Urbanecm) @Matthewrbowker I'll find one.
[06:54:33] <wikibugs>	 06Operations, 10Mail, 10OTRS: E-mail incorrectly forwarded to wm-cz OTRS e-mail - https://phabricator.wikimedia.org/T129743#2395753 (10Matthewrbowker) 05Open>03Resolved a:03Matthewrbowker I found it.  OTRS uses inbound email addresses to sort email.  So I added another filter for info@wikimedia.cz (htt...
[06:56:40] <icinga-wm>	 RECOVERY - puppet last run on subra is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:56:41] <icinga-wm>	 RECOVERY - puppet last run on mw1120 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:57:32] <icinga-wm>	 RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[06:57:51] <icinga-wm>	 RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:01] <icinga-wm>	 RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:02] <icinga-wm>	 RECOVERY - puppet last run on mc2007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:11] <icinga-wm>	 PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[06:58:21] <icinga-wm>	 RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[06:58:22] <icinga-wm>	 RECOVERY - puppet last run on cp2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:00:04] <wikibugs>	 06Operations, 10Mail, 10OTRS: E-mail incorrectly forwarded to wm-cz OTRS e-mail - https://phabricator.wikimedia.org/T129743#2395756 (10Urbanecm) Thanks!
[07:14:20] <icinga-wm>	 PROBLEM - Disk space on elastic1024 is CRITICAL: DISK CRITICAL - free space: /var/lib/elasticsearch 79972 MB (15% inode=99%)
[07:15:00] <grrrit-wm>	 (03CR) 10Muehlenhoff: "Looks good, but let's first land the config change for Jenkins." [puppet] - 10https://gerrit.wikimedia.org/r/295255 (https://phabricator.wikimedia.org/T80385) (owner: 10Hashar)
[07:15:28] <wikibugs>	 06Operations, 06Release-Engineering-Team, 07Developer-notice, 05Gitblit-Deprecate, and 2 others: Redirect Gitblit urls (git.wikimedia.org) -> Diffusion urls (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2395760 (10Paladox) @greg thanks and sorry.
[07:18:40] <icinga-wm>	 RECOVERY - Disk space on elastic1024 is OK: DISK OK
[07:20:31] <icinga-wm>	 PROBLEM - puppet last run on mw2084 is CRITICAL: CRITICAL: puppet fail
[07:20:45] <wikibugs>	 06Operations, 10DBA: m1-master switch from db1001 to db1016 - https://phabricator.wikimedia.org/T106312#2395761 (10akosiaris) >>! In T106312#2394393, @jcrespo wrote: > @Akosiaris @MoritzMuehlenhoff @fgiunchedi @demon @Krenair @Dzahn I intend to perform the failover on Wednesday 22, 16:00 UTC.  >  > I do not re...
[07:27:09] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] nagios_common: delete check_http_bits command [puppet] - 10https://gerrit.wikimedia.org/r/295321 (https://phabricator.wikimedia.org/T107430) (owner: 10Dzahn)
[07:34:51] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: "+1ed from me, let's change the jenkins config and run a" [puppet] - 10https://gerrit.wikimedia.org/r/295255 (https://phabricator.wikimedia.org/T80385) (owner: 10Hashar)
[07:39:09] <elukey>	 !log restarted hhvm on mw1139 (hhvm-dump in /tmp/hhvm.20736.bt.)
[07:39:13] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[07:41:42] <elukey>	 !log restarted hhvm on mw1141 - hhvm was getting SEGV (dump in /tmp/hhvm.8735.bt.)
[07:41:46] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[07:42:32] <elukey>	 something is definitely weird
[07:42:51] <elukey>	 icinga is now telling me that other two mw servers have CRITICALs
[07:44:18] <elukey>	 mw1197 and mw1230, that have ubuntu so definitely not new appservers
[07:46:10] <elukey>	 seems to be all API servers
[07:46:42] <elukey>	 and now auto-resolve
[07:46:44] <elukey>	 *resolved
[07:48:31] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production).
[07:48:31] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production).
[07:48:42] <icinga-wm>	 RECOVERY - puppet last run on mw2084 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[07:52:13] <grrrit-wm>	 (03CR) 10ArielGlenn: [C: 031] Move dataset ferm rules into the role [puppet] - 10https://gerrit.wikimedia.org/r/294930 (owner: 10Muehlenhoff)
[07:52:32] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] Manage Postgresql data dir with Puppet (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/295227 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel)
[07:53:26] <wikibugs>	 06Operations, 10Ops-Access-Requests, 06Parsing-Team, 06Services: Allow the Services team to administer the Parsoid cluster - https://phabricator.wikimedia.org/T137879#2382336 (10akosiaris) LGTM
[07:57:16] <grrrit-wm>	 (03PS2) 10Muehlenhoff: drac,icinga,ipmi: do not ensure => latest [puppet] - 10https://gerrit.wikimedia.org/r/295322 (https://phabricator.wikimedia.org/T115348) (owner: 10Dzahn)
[07:57:52] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] Phabricator: remove remote 'origin' from system-wide gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/294945 (https://phabricator.wikimedia.org/T137819) (owner: 1020after4)
[07:57:57] <grrrit-wm>	 (03PS4) 10Alexandros Kosiaris: Phabricator: remove remote 'origin' from system-wide gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/294945 (https://phabricator.wikimedia.org/T137819) (owner: 1020after4)
[07:58:02] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [V: 032] Phabricator: remove remote 'origin' from system-wide gitconfig [puppet] - 10https://gerrit.wikimedia.org/r/294945 (https://phabricator.wikimedia.org/T137819) (owner: 1020after4)
[07:58:31] <grrrit-wm>	 (03CR) 10Gehel: Manage Postgresql data dir with Puppet (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/295227 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel)
[08:00:51] <icinga-wm>	 PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 1 failures
[08:02:12] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[08:03:41] <icinga-wm>	 PROBLEM - puppet last run on cp3007 is CRITICAL: CRITICAL: Puppet has 2 failures
[08:03:51] <icinga-wm>	 PROBLEM - Disk space on fluorine is CRITICAL: DISK CRITICAL - free space: /a 135873 MB (3% inode=99%)
[08:04:00] <icinga-wm>	 PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[08:05:30] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[08:05:32] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge.
[08:05:51] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge.
[08:07:32] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[08:07:51] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[08:08:24] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: postgres: Specify extra module in RSpec [puppet] - 10https://gerrit.wikimedia.org/r/295327 
[08:08:47] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 06WMF-Legal, and 2 others: Gerrit seemingly violates data retention guidelines - https://phabricator.wikimedia.org/T114395#2395835 (10ZhouZ)
[08:09:42] <grrrit-wm>	 (03PS3) 10Muehlenhoff: drac,icinga,ipmi: do not ensure => latest [puppet] - 10https://gerrit.wikimedia.org/r/295322 (https://phabricator.wikimedia.org/T115348) (owner: 10Dzahn)
[08:10:44] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] drac,icinga,ipmi: do not ensure => latest [puppet] - 10https://gerrit.wikimedia.org/r/295322 (https://phabricator.wikimedia.org/T115348) (owner: 10Dzahn)
[08:13:48] <grrrit-wm>	 (03CR) 10Gehel: [C: 031] postgres: Specify extra module in RSpec [puppet] - 10https://gerrit.wikimedia.org/r/295327 (owner: 10Alexandros Kosiaris)
[08:16:51] <icinga-wm>	 RECOVERY - Disk space on fluorine is OK: DISK OK
[08:17:31] <icinga-wm>	 PROBLEM - puppet last run on mw2075 is CRITICAL: CRITICAL: puppet fail
[08:17:45] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Move dataset ferm rules into the role [puppet] - 10https://gerrit.wikimedia.org/r/294930 
[08:19:41] <grrrit-wm>	 (03CR) 10Gehel: [C: 032] postgres: Specify extra module in RSpec [puppet] - 10https://gerrit.wikimedia.org/r/295327 (owner: 10Alexandros Kosiaris)
[08:21:31] <icinga-wm>	 PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[08:23:18] <grrrit-wm>	 (03PS3) 10Muehlenhoff: Move dataset ferm rules into the role [puppet] - 10https://gerrit.wikimedia.org/r/294930 
[08:23:42] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: "I find that ldaplist is really to blame here (i.e. why on earth grepping for a term on the output makes sense, while submitting that term " [puppet] - 10https://gerrit.wikimedia.org/r/295198 (https://phabricator.wikimedia.org/T122595) (owner: 10Muehlenhoff)
[08:23:50] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] Move dataset ferm rules into the role [puppet] - 10https://gerrit.wikimedia.org/r/294930 (owner: 10Muehlenhoff)
[08:24:49] <grrrit-wm>	 (03PS7) 10Gehel: Manage Postgresql data dir with Puppet [puppet] - 10https://gerrit.wikimedia.org/r/295227 (https://phabricator.wikimedia.org/T138092) 
[08:26:29] <grrrit-wm>	 (03CR) 10Gehel: Manage Postgresql data dir with Puppet (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/295227 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel)
[08:26:31] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] fix puppet unit test for squid3 [puppet] - 10https://gerrit.wikimedia.org/r/295130 (owner: 10Maturain)
[08:26:37] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: fix puppet unit test for squid3 [puppet] - 10https://gerrit.wikimedia.org/r/295130 (owner: 10Maturain)
[08:26:43] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [V: 032] fix puppet unit test for squid3 [puppet] - 10https://gerrit.wikimedia.org/r/295130 (owner: 10Maturain)
[08:27:20] <icinga-wm>	 RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures
[08:28:01] <icinga-wm>	 RECOVERY - puppet last run on cp3007 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:32:29] <grrrit-wm>	 (03CR) 10Muehlenhoff: "Fine with me, personally I don't even see the need for ldaplist(1) to begin with. It's a reimplementation of some obscure Solaris tool and" [puppet] - 10https://gerrit.wikimedia.org/r/295198 (https://phabricator.wikimedia.org/T122595) (owner: 10Muehlenhoff)
[08:32:36] <grrrit-wm>	 (03Abandoned) 10Muehlenhoff: Bump the size limit for labs openldap server to 4096 [puppet] - 10https://gerrit.wikimedia.org/r/295198 (https://phabricator.wikimedia.org/T122595) (owner: 10Muehlenhoff)
[08:38:55] <icinga-wm>	 PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0]
[08:39:38] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: "change looks good in premise, but pcc complains on labsdb1006" [puppet] - 10https://gerrit.wikimedia.org/r/295227 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel)
[08:42:54] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: "I can't say I love ldaplist(1) either. I only regard it as a somewhat friendlier interface to LDAP than ldapsearch. Regardless of the tool" [puppet] - 10https://gerrit.wikimedia.org/r/295198 (https://phabricator.wikimedia.org/T122595) (owner: 10Muehlenhoff)
[08:45:47] <grrrit-wm>	 (03CR) 10Muehlenhoff: "cc @abogott who made the original request/use case in T122595" [puppet] - 10https://gerrit.wikimedia.org/r/295198 (https://phabricator.wikimedia.org/T122595) (owner: 10Muehlenhoff)
[08:46:17] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 032] otrs: add check_procs for clamd/freshclam [puppet] - 10https://gerrit.wikimedia.org/r/294939 (https://phabricator.wikimedia.org/T137188) (owner: 10Faidon Liambotis)
[08:46:23] <grrrit-wm>	 (03PS2) 10Alexandros Kosiaris: otrs: add check_procs for clamd/freshclam [puppet] - 10https://gerrit.wikimedia.org/r/294939 (https://phabricator.wikimedia.org/T137188) (owner: 10Faidon Liambotis)
[08:46:35] <icinga-wm>	 RECOVERY - puppet last run on mw2075 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[08:46:43] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [V: 032] otrs: add check_procs for clamd/freshclam [puppet] - 10https://gerrit.wikimedia.org/r/294939 (https://phabricator.wikimedia.org/T137188) (owner: 10Faidon Liambotis)
[08:46:47] <wikibugs>	 06Operations, 10DBA: m1-master switch from db1001 to db1016 - https://phabricator.wikimedia.org/T106312#2395891 (10fgiunchedi) >>! In T106312#2395761, @akosiaris wrote: >>>! In T106312#2394393, @jcrespo wrote: >> @Akosiaris @MoritzMuehlenhoff @fgiunchedi @demon @Krenair @Dzahn I intend to perform the failover...
[08:47:32] <wikibugs>	 06Operations, 10DBA: m1-master switch from db1001 to db1016 - https://phabricator.wikimedia.org/T106312#2395893 (10MoritzMuehlenhoff) 17 UTC also fine with me
[08:48:22] <grrrit-wm>	 (03PS2) 10Muehlenhoff: package_builder: Add gobject-introspection to package list [puppet] - 10https://gerrit.wikimedia.org/r/295226 
[08:48:52] <grrrit-wm>	 (03CR) 10Muehlenhoff: [C: 032 V: 032] package_builder: Add gobject-introspection to package list [puppet] - 10https://gerrit.wikimedia.org/r/295226 (owner: 10Muehlenhoff)
[08:49:25] <icinga-wm>	 PROBLEM - puppet last run on mw2123 is CRITICAL: CRITICAL: puppet fail
[08:58:14] <grrrit-wm>	 (03PS8) 10Gehel: Manage Postgresql data dir with Puppet [puppet] - 10https://gerrit.wikimedia.org/r/295227 (https://phabricator.wikimedia.org/T138092) 
[09:00:01] <wikibugs>	 06Operations, 10DBA: m1-master switch from db1001 to db1016 - https://phabricator.wikimedia.org/T106312#2395900 (10jcrespo) Let's go with [[ https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160622T1500 | 17 UTC ]] instead
[09:02:54] <grrrit-wm>	 (03CR) 10Gehel: "Puppet compiler now looks good https://puppet-compiler.wmflabs.org/3152/" [puppet] - 10https://gerrit.wikimedia.org/r/295227 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel)
[09:12:17] <grrrit-wm>	 (03CR) 10DCausse: "I agree with Erik, the regexes are too complex imo and really hard to determine if a node is considered or not." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/294918 (owner: 10Gehel)
[09:12:19] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 031] Manage Postgresql data dir with Puppet [puppet] - 10https://gerrit.wikimedia.org/r/295227 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel)
[09:15:03] <grrrit-wm>	 (03CR) 10Gehel: "Puppet compiler now fails on restbase1001.eqiad.wmnet. But the production catalogue also fails, so the issue is probably outside of this c" [puppet] - 10https://gerrit.wikimedia.org/r/295123 (https://phabricator.wikimedia.org/T137422) (owner: 10Nicko)
[09:18:34] <icinga-wm>	 RECOVERY - puppet last run on mw2123 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[09:21:53] <grrrit-wm>	 (03PS1) 10Jcrespo: Depool db1068; repool db1070 as api [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295328 
[09:22:39] <moritzm>	 !log rolling reboot of logstash cluster to Linux 4.4
[09:22:43] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:23:16] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Depool db1068; repool db1070 as api [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295328 (owner: 10Jcrespo)
[09:25:06] <logmsgbot>	 !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1068; repool db1070 and db1071 as api (duration: 00m 27s)
[09:25:10] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:27:14] <icinga-wm>	 PROBLEM - BGP status on cr2-ulsfo is CRITICAL: BGP CRITICAL - AS2914/IPv4: Active, AS2914/IPv6: Active
[09:27:44] <icinga-wm>	 PROBLEM - NTP on labsdb1008 is CRITICAL: NTP CRITICAL: Offset unknown
[09:28:59] <jynus>	 mmm
[09:29:51] <jynus>	 I was worring that meant network/load/down problems, but it does not
[09:31:42] <grrrit-wm>	 (03PS3) 10Gehel: Configuration for new elasticsearch servers in eqiad. [puppet] - 10https://gerrit.wikimedia.org/r/294918 
[09:34:14] <grrrit-wm>	 (03PS9) 10Gehel: Manage Postgresql data dir with Puppet [puppet] - 10https://gerrit.wikimedia.org/r/295227 (https://phabricator.wikimedia.org/T138092) 
[09:35:40] <grrrit-wm>	 (03CR) 10Gehel: [C: 032] Manage Postgresql data dir with Puppet [puppet] - 10https://gerrit.wikimedia.org/r/295227 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel)
[09:39:38] <icinga-wm>	 RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0]
[09:40:49] <grrrit-wm>	 (03CR) 10DCausse: [C: 031] Configuration for new elasticsearch servers in eqiad. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/294918 (owner: 10Gehel)
[09:43:24] <gehel>	 !log lowering disk high watermark to rebalance elasticsearch eqiad cluster disk space
[09:43:28] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[09:46:08] <icinga-wm>	 RECOVERY - NTP on labsdb1008 is OK: NTP OK: Offset -0.00455057621 secs
[09:51:29] <wikibugs>	 06Operations, 10DBA: Decomission old coredb machines (<=db1050) - https://phabricator.wikimedia.org/T134476#2395956 (10jcrespo) a:03jcrespo
[09:52:35] <wikibugs>	 06Operations, 10DBA: Upgrade db1022, which has an older kernel - https://phabricator.wikimedia.org/T101516#2395963 (10jcrespo)
[09:53:16] <grrrit-wm>	 (03CR) 10Nicko: Include a cassandra::instance::monitoring class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/295123 (https://phabricator.wikimedia.org/T137422) (owner: 10Nicko)
[09:54:02] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 06WMF-Legal, and 2 others: Gerrit seemingly violates data retention guidelines - https://phabricator.wikimedia.org/T114395#1694145 (10hashar) Gerrit runs on ytterbium and Apache2 has a logrotate rule: ``` $ cat /etc/logrotate.d/apache2  /var/log/apache2/*...
[09:54:06] <wikibugs>	 06Operations, 10Gerrit, 06Release-Engineering-Team, 06WMF-Legal, and 2 others: Gerrit seemingly violates data retention guidelines - https://phabricator.wikimedia.org/T114395#2395969 (10hashar) p:05High>03Normal
[10:03:09] <moritzm>	 !log installing wget security updates on Ubuntu systems
[10:03:13] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:11:06] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1274 is CRITICAL: Host mw1274 is not in mediawiki-installation dsh group
[10:12:07] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1283 is CRITICAL: Host mw1283 is not in mediawiki-installation dsh group
[10:12:56] <icinga-wm>	 RECOVERY - puppet last run on mw1282 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[10:13:28] <elukey>	 the above ones are mine, new appservers, but they were silenced
[10:13:51] <elukey>	 mmm apparently expired
[10:13:55] <elukey>	 re-set downtime
[10:19:02] <wikibugs>	 06Operations, 10vm-requests: eqiad/codfw: 4 VM request for prometheus - https://phabricator.wikimedia.org/T136313#2395997 (10fgiunchedi) p:05Triage>03Normal a:05Dzahn>03fgiunchedi
[10:22:21] <moritzm>	 !log installing expat security updates on Ubuntu systems
[10:22:24] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:24:18] <icinga-wm>	 RECOVERY - puppet last run on ms-be2012 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[10:25:15] <wikibugs>	 06Operations, 10ops-codfw: ms-be2012.codfw.wmnet: slot=10 dev=sdk failed - https://phabricator.wikimedia.org/T135975#2396006 (10fgiunchedi) 05Open>03Resolved disk rebuilding
[10:25:17] <grrrit-wm>	 (03CR) 10Hashar: "I have tried the migration on my local machine and commented on T80385. In short:" [puppet] - 10https://gerrit.wikimedia.org/r/295255 (https://phabricator.wikimedia.org/T80385) (owner: 10Hashar)
[10:27:20] <grrrit-wm>	 (03PS1) 10Ema: tlsproxy: enable client/server TFO support in the kernel [puppet] - 10https://gerrit.wikimedia.org/r/295331 (https://phabricator.wikimedia.org/T108827) 
[10:32:05] <grrrit-wm>	 (03PS2) 10Ema: tlsproxy: enable client/server TFO support in the kernel [puppet] - 10https://gerrit.wikimedia.org/r/295331 (https://phabricator.wikimedia.org/T108827) 
[10:32:09] <godog>	 !log reboot ms-be2003 for disk ordering - T137785
[10:32:10] <stashbot>	 T137785: ms-be2003.codfw.wmnet: slot=4 dev=sde failed - https://phabricator.wikimedia.org/T137785
[10:32:12] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[10:40:28] <grrrit-wm>	 (03CR) 10Hashar: "I guess we will want:" [puppet] - 10https://gerrit.wikimedia.org/r/295255 (https://phabricator.wikimedia.org/T80385) (owner: 10Hashar)
[10:42:16] <icinga-wm>	 RECOVERY - puppet last run on ms-be2003 is OK: OK: Puppet is currently enabled, last run 8 seconds ago with 0 failures
[10:43:15] <wikibugs>	 06Operations, 10ops-codfw: ms-be2003.codfw.wmnet: slot=4 dev=sde failed - https://phabricator.wikimedia.org/T137785#2396028 (10fgiunchedi) 05Open>03Resolved disk rebuilding
[10:44:06] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: network: add $production_networks (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis)
[10:46:56] <grrrit-wm>	 (03PS31) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis)
[10:56:49] <grrrit-wm>	 (03CR) 10Muehlenhoff: tlsproxy: enable client/server TFO support in the kernel (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/295331 (https://phabricator.wikimedia.org/T108827) (owner: 10Ema)
[11:01:56] <icinga-wm>	 RECOVERY - BGP status on cr2-ulsfo is OK: BGP OK - up: 77, down: 2, shutdown: 0
[11:05:54] <wikibugs>	 06Operations, 10Traffic, 13Patch-For-Review: Investigate TCP Fast Open for tlsproxy - https://phabricator.wikimedia.org/T108827#2396050 (10ema) So, here are a few findings so far.  tshark can be used to detect SYN packets with a TFO cookie request:    tshark -f 'tcp[tcpflags] & tcp-syn != 0' -Y 'tcp.options....
[11:06:01] <jynus>	 !log reimaging db1068
[11:06:05] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[11:10:27] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: ferm: Kill INTERNAL_V4/INTERNAL_V6 definitions [puppet] - 10https://gerrit.wikimedia.org/r/295332 
[11:10:29] <grrrit-wm>	 (03PS1) 10Alexandros Kosiaris: ferm: Populate INTERNAL from network::constants [puppet] - 10https://gerrit.wikimedia.org/r/295333 
[11:15:20] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: network: add $production_networks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis)
[11:15:29] <akosiaris>	 paravoid: ^
[11:18:06] <paravoid>	 this will probably break Labs instances
[11:18:54] <paravoid>	 hm, or not? would sphere => private include labs instances networks?
[11:18:56] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://citoid.svc.codfw.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.codfw.wmnet:1970/api
[11:20:56] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy
[11:22:34] <paravoid>	 what was that?
[11:22:43] <yurik_>	 anomie, are you doing morning swat today?
[11:25:27] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:26:19] <paravoid>	 akosiaris, mobrovac?
[11:27:35] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy
[11:28:06] <jynus>	 in the past, temporary issues were caused by dependency on url_downloader
[11:28:15] <jynus>	 on codfw
[11:28:22] <jynus>	 let me discard that
[11:29:16] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:29:36] * akosiaris looking
[11:29:47] <akosiaris>	 never got the page btw
[11:30:15] <icinga-wm>	 PROBLEM - check_mysql on fdb2001 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 2198
[11:31:50] <paravoid>	 me neither
[11:32:09] <akosiaris>	 Unable to locate resource with pmcid PMC9999999
[11:32:21] <akosiaris>	 guess which governmental database had problems again
[11:32:54] <akosiaris>	 jynus: did you get the page ?
[11:32:58] <jynus>	 nope
[11:33:02] <akosiaris>	 hmm
[11:33:28] <jynus>	 alsafi is up, BTW
[11:33:51] <akosiaris>	 jynus: yeah, had nothing to do with url_downloader this time around
[11:34:10] <akosiaris>	 it's the gov database that citoid uses to make use PCMIDs are ok
[11:34:15] <akosiaris>	 valid or something
[11:35:12] <icinga-wm>	 RECOVERY - check_mysql on fdb2001 is OK: Uptime: 652776 Threads: 1 Questions: 6620000 Slow queries: 4145 Opens: 713 Flush tables: 2 Open tables: 574 Queries per second avg: 10.141 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 0
[11:35:23] <akosiaris>	 paravoid: break labs instances as as to what ?
[11:35:32] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy
[11:35:58] <akosiaris>	 I 've varied on $::realm on purpose to avoid exactly that
[11:36:27] <akosiaris>	 the one thing that might break indeed is the labs support hosts
[11:36:32] <jynus>	 akosiaris, are you saying that an external resource breaks the app, or just the check is sensible to that?
[11:37:08] <grrrit-wm>	 (03PS1) 10KartikMistry: Deploy Compact Language Links as default (Stage 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295334 (https://phabricator.wikimedia.org/T136677) 
[11:37:29] <akosiaris>	 jynus: the service relies on an external resource to validate a given PMCID is valid. It submits a request to http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi
[11:37:41] <akosiaris>	 the check, being swagger based, check all endpoints
[11:37:54] <akosiaris>	 and submits on purpose an invalid check for a PMCID of 99999999
[11:38:03] <akosiaris>	 plus/minus a few 9s
[11:38:30] <akosiaris>	 but when that external resource is unresponsive, the check fails and we get that
[11:38:56] <jynus>	 so it is the check that fails, not the whole application
[11:39:08] <jynus>	 1 heuristic check
[11:39:23] <jynus>	 that is ok
[11:39:55] <jynus>	 is like the "San Francisco" header, being heuristic is ok
[11:40:06] <akosiaris>	 well, it should not be alerting
[11:40:13] <jynus>	 yes
[11:40:27] <jynus>	 or if it is, it should be clear why
[11:40:38] <paravoid>	 can we not page when a random third-party service fails please? :)
[11:40:50] <akosiaris>	 and ofc there is the big question on why we are relying our monitoring on an external database
[11:41:15] <paravoid>	 alerts should be actionable
[11:41:25] <akosiaris>	 the answer usually is "we can not know if a PMCID is invalid unless someone else tells us so" but I 've never understood why that has to be monitored 
[11:41:33] <jynus>	 I would be more worried about pinging every 5 minutes an external resource
[11:41:47] <akosiaris>	 jynus: that is what we are effectively doing
[11:41:59] <jynus>	 akosiaris, that sems like an application-level check
[11:42:06] <jynus>	 ok to do on "user space"
[11:42:16] <jynus>	 not on "infrastructure space"
[11:42:21] <jynus>	 if that means something
[11:42:27] <akosiaris>	 yeah, I don't follow
[11:42:50] <akosiaris>	 remember that our checks are integrated to the service due to monitoring everything the swagger spec advertises
[11:43:04] <jynus>	 yes
[11:43:04] <akosiaris>	 which in premise is what we want
[11:43:14] <akosiaris>	 making sure all advertised endpoints actually work
[11:43:31] <akosiaris>	 but honestly, monitoring that endpoint, obviously makes no sense
[11:43:35] <jynus>	 but maybe everything should not be advertised
[11:43:49] <jynus>	 only the minimum things to say "this is up"
[11:43:51] <akosiaris>	 well, it's a bit more complex than that
[11:43:56] <jynus>	 I know
[11:44:01] <akosiaris>	 no, the premise is the other way around
[11:44:06] <jynus>	 I am not complaining to you
[11:44:08] <akosiaris>	 do everything
[11:44:19] <akosiaris>	 otherwise you have errors you don't ever catch
[11:44:23] <akosiaris>	 and downtimes you never catch
[11:44:48] <akosiaris>	 likes ORES being down 8 hours the other day and noone noticing because it responsed ok to our monitoring
[11:44:57] <akosiaris>	 but it would return 503 for a number of endpoints
[11:45:09] <akosiaris>	 ORES is in the process of fixing the swagger spec btw
[11:45:19] <akosiaris>	 so soon that should not be possible to happen again
[11:45:22] <akosiaris>	 but I digress
[11:46:05] <akosiaris>	 so, I think we should have a way of informing our service_checker that a specific endpoint should not be monitored
[11:46:42] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:46:53] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:47:13] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:47:43] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[11:48:04] <jynus>	 so, if a service is up (processes are running, hardware is healthy, protocol responds), an a programming error prevents a very specific code to run, why should I be paged?
[11:48:17] <akosiaris>	 same thing btw ^
[11:48:26] <akosiaris>	 curl 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pmc&retmode=xml&id=9999999'
[11:48:30] <akosiaris>	 never returns...
[11:48:41] <jynus>	 the other day I got called up for trying to add preciselly those kind of checks to icinga
[11:48:52] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://citoid.svc.codfw.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.codfw.wmnet:1970/api
[11:48:52] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad PMCID) is CRITICAL: Could not fetch url http://citoid.svc.eqiad.wmnet:1970/api: Timeout on connection while downloading http://citoid.svc.eqiad.wmnet:1970/api
[11:48:59] <jynus>	 *out
[11:49:22] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy
[11:50:06] <akosiaris>	 jynus: you are opening a long conversation here, but on premise yes I agree with you, I don't think we should be paged for something like that. 
[11:50:31] <akosiaris>	 the actual implementation details ofc don't exist
[11:50:34] <akosiaris>	 but you are right
[11:50:44] <jynus>	 actually, my position is that it should page, and more checks are good, but to the right person
[11:51:06] <jynus>	 or with the right protocols
[11:51:56] <akosiaris>	 I was commenting on the "why should I be paged?"
[11:52:05] <akosiaris>	 specifically the "I" part
[11:52:10] <akosiaris>	 where I == ops for me
[11:52:20] <akosiaris>	 somebody should be paged ofc
[11:52:45] <akosiaris>	 speaking of which, I 've got no SMS yet
[11:53:05] <akosiaris>	 stalled ? never sent ? 
[11:53:12] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy
[11:53:12] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy
[11:53:22] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy
[11:53:26] <jynus>	 yesterday lots of sms were sent
[11:53:32] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy
[11:53:36] <akosiaris>	 yup. I 've received those
[11:53:38] <jynus>	 do you want me to provoke a page?
[11:53:44] <akosiaris>	 lol
[11:53:52] <akosiaris>	 nope, I can do that on my own :-)
[11:53:53] <jynus>	 I'm serious
[11:54:10] <mark>	 very serious
[11:54:14] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy
[11:54:16] <akosiaris>	 I actually rely on pfw on codfw to do that anytime now
[11:56:31] <akosiaris>	 ah,
[11:56:39] <akosiaris>	 so we were not meant to be paged 
[11:56:54] <jynus>	 which is ok
[11:56:56] <akosiaris>	 the failing LVS check is the one _joe_ introduced the other day that relies on service_checker
[11:57:12] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[11:57:13] <akosiaris>	 which is good
[11:57:38] <akosiaris>	 _joe_: :)
[11:58:13] <gehel>	 ^ checking elasticsearch...
[12:11:02] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0]
[12:16:36] <paravoid>	 akosiaris: so what is "sphere => private" supposed to cover?
[12:16:58] <paravoid>	 is it what it's now? half of production + all of Labs?
[12:25:29] <gehel>	 !log lowering throttling limit for index recovery on eqiad elasticsearch cluster
[12:25:33] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:30:39] <hashar>	 !log T136973 started cut of branch wmf/1.28.0-wmf.7
[12:30:40] <stashbot>	 T136973: MW-1.28.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T136973
[12:30:43] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:33:21] <hoo>	 !log Removed Wikidata json dumps from 20160620 (inconsistent, per T138291).
[12:33:22] <stashbot>	 T138291: Latest wikidata JSON dump contains unexpected sql warning - https://phabricator.wikimedia.org/T138291
[12:33:25] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:35:52] <gehel>	 !log lowering throttling limit for index recovery on codfw elasticsearch cluster
[12:35:55] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:36:29] <hoo>	 !log Started a new JSON dump creation on snapshot1003 (after the last one was inconsistent, per T138291)
[12:36:30] <stashbot>	 T138291: Latest wikidata JSON dump contains unexpected sql warning - https://phabricator.wikimedia.org/T138291
[12:36:34] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[12:49:17] <RoanKattouw>	 !log Running extensions/Echo/maintenance/backfillReadBundles.php on all Echo-enabled wikis
[12:49:20] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Mr. Obvious
[12:49:35] <RoanKattouw>	 !log Running extensions/Echo/maintenance/backfillReadBundles.php on all Echo-enabled wikis for T136368
[12:49:36] <stashbot>	 T136368: Dynamic bundle: non-bundle_base notifications need a read timestamp - https://phabricator.wikimedia.org/T136368
[12:49:39] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Mr. Obvious
[12:55:03] <icinga-wm>	 PROBLEM - puppet last run on cp3046 is CRITICAL: CRITICAL: puppet fail
[12:57:23] <icinga-wm>	 RECOVERY - puppet last run on cp3046 is OK: OK: Puppet is currently enabled, last run 41 seconds ago with 0 failures
[12:57:49] <moritzm>	 !log rolling restart of hhvm/apache in codfw for expat security update
[12:57:53] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:11:28] <RoanKattouw>	 !log Running extensions/Echo/maintenance/removeOrphanedEvents.php on all Echo-enabled wikis for T136425
[13:11:29] <stashbot>	 T136425: Remove orphaned echo_event rows - https://phabricator.wikimedia.org/T136425
[13:11:32] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Mr. Obvious
[13:12:42] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1286 is CRITICAL: Host mw1286 is not in mediawiki-installation dsh group
[13:12:42] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1285 is CRITICAL: Host mw1285 is not in mediawiki-installation dsh group
[13:13:12] <elukey>	 --^ these are mine
[13:15:09] <hashar>	 !log T136973 applied all security patches to 1.28.0-wmf.7
[13:15:10] <stashbot>	 T136973: MW-1.28.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T136973
[13:15:13] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:16:58] <grrrit-wm>	 (03PS1) 10Hashar: Group0 to 1.28.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295339 (https://phabricator.wikimedia.org/T136973) 
[13:18:12] <hashar>	 I am willing to push 1.28.0-wmf.7 to testwiki  soonish  but apparently there are a few things going on 
[13:18:15] <hashar>	 so I will hold a bit :)
[13:27:40] <grrrit-wm>	 (03PS1) 10Jcrespo: Repool db1068 with low weight; depool db1061 and db1062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295341 
[13:28:45] <grrrit-wm>	 (03CR) 10Jcrespo: [C: 032] Repool db1068 with low weight; depool db1061 and db1062 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295341 (owner: 10Jcrespo)
[13:35:21] <icinga-wm>	 PROBLEM - Apache HTTP on mw1284 is CRITICAL: Connection refused
[13:36:32] <icinga-wm>	 PROBLEM - puppet last run on mw1284 is CRITICAL: Connection refused by host
[13:37:01] <icinga-wm>	 PROBLEM - salt-minion processes on mw1284 is CRITICAL: Connection refused by host
[13:37:46] <grrrit-wm>	 (03CR) 10Joal: "@Elukey: Given the talk we had and the fact that Brandon asked you to log everything, maybe this filter should be removed?" [puppet] - 10https://gerrit.wikimedia.org/r/294455 (https://phabricator.wikimedia.org/T136314) (owner: 10Elukey)
[13:37:51] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1284 is CRITICAL: Connection refused by host
[13:38:10] <icinga-wm>	 PROBLEM - DPKG on mw1284 is CRITICAL: Connection refused by host
[13:38:30] <icinga-wm>	 PROBLEM - Disk space on mw1284 is CRITICAL: Connection refused by host
[13:38:50] <icinga-wm>	 PROBLEM - MD RAID on mw1284 is CRITICAL: Connection refused by host
[13:39:41] <icinga-wm>	 PROBLEM - configured eth on mw1284 is CRITICAL: Connection refused by host
[13:40:01] <icinga-wm>	 PROBLEM - dhclient process on mw1284 is CRITICAL: Connection refused by host
[13:40:21] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw1284 is CRITICAL: Host mw1284 is not in mediawiki-installation dsh group
[13:40:41] <icinga-wm>	 PROBLEM - nutcracker port on mw1284 is CRITICAL: Connection refused by host
[13:41:00] <icinga-wm>	 PROBLEM - nutcracker process on mw1284 is CRITICAL: Connection refused by host
[13:41:09] <elukey>	 this is me! new appserver
[13:41:16] <elukey>	 didn't see it on icinga till now
[13:41:34] <elukey>	 silencing
[13:41:38] <apergos>	 ah I was aobut to ask if that was mori tz
[13:41:41] <apergos>	 thanks
[13:42:02] <moritzm>	 nope, I haven't done anything wrt mw1* servers yet
[13:42:42] <wikibugs>	 06Operations, 06Discovery, 06Maps, 03Maps-Sprint, 13Patch-For-Review: Configure new maps servers in eqiad - https://phabricator.wikimedia.org/T138092#2388933 (10Gehel)
[13:43:04] <wikibugs>	 06Operations, 06Discovery, 06Maps, 03Maps-Sprint, 13Patch-For-Review: Configure new maps servers in eqiad - https://phabricator.wikimedia.org/T138092#2388933 (10Gehel) a:03Gehel
[13:43:31] <icinga-wm>	 PROBLEM - puppet last run on mw2192 is CRITICAL: CRITICAL: puppet fail
[13:44:24] <wikibugs>	 06Operations, 06Discovery, 06Maps, 07Epic: Epic: cultivating the Maps garden - https://phabricator.wikimedia.org/T137616#2396438 (10Gehel)
[13:44:26] <wikibugs>	 06Operations, 06Discovery, 06Maps, 03Maps-Sprint, 13Patch-For-Review: Rack/Setup 4 map servers in eqiad - https://phabricator.wikimedia.org/T135018#2396437 (10Gehel) 05Open>03Resolved
[13:46:23] <grrrit-wm>	 (03PS1) 10Gehel: Postgresql: init database with Puppet [puppet] - 10https://gerrit.wikimedia.org/r/295343 (https://phabricator.wikimedia.org/T138092) 
[13:47:44] <hashar>	 going to live hack mediawiki-config to push  1.28.0-wmf.7 to testwiki
[13:48:22] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: svc: add graphite LVS addresses [dns] - 10https://gerrit.wikimedia.org/r/289635 (https://phabricator.wikimedia.org/T85451) 
[13:48:24] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: add prometheus VMs in eqiad/codfw [dns] - 10https://gerrit.wikimedia.org/r/295344 (https://phabricator.wikimedia.org/T136313) 
[13:48:51] <logmsgbot>	 !log hashar@tin Started scap: testwiki to 1.28.0-wmf.7  T136973
[13:48:52] <stashbot>	 T136973: MW-1.28.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T136973
[13:48:55] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:49:04] <wikibugs>	 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: Migrate labsdb1005/1006/1007 to jessie - https://phabricator.wikimedia.org/T123731#2396500 (10chasemp) p:05Triage>03Normal one thought is we have an influx of new labsdb things coming I believe. This way sort itself out w/o a lot of in-place shuffling.
[13:49:21] <icinga-wm>	 PROBLEM - Unmerged changes on repository mediawiki_config on tin is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master).
[13:49:32] <icinga-wm>	 PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master).
[13:49:40] <hashar>	 ^^both are me I believe
[13:51:30] <grrrit-wm>	 (03PS2) 10Filippo Giunchedi: add prometheus VMs in eqiad/codfw [dns] - 10https://gerrit.wikimedia.org/r/295344 (https://phabricator.wikimedia.org/T136313) 
[13:52:18] <jynus>	 it could be me
[13:52:26] <grrrit-wm>	 (03CR) 10Gehel: "Puppet compiler is not telling much, but at least it compiles cleanly: https://puppet-compiler.wmflabs.org/3154/" [puppet] - 10https://gerrit.wikimedia.org/r/295343 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel)
[13:52:29] <grrrit-wm>	 (03CR) 10Hashar: "Will do the renaming tomorrow (Wednesday) during European morning." [puppet] - 10https://gerrit.wikimedia.org/r/295255 (https://phabricator.wikimedia.org/T80385) (owner: 10Hashar)
[13:52:52] <hashar>	 jynus: oh there is a db pooling change that is left undeployed 
[13:52:59] <jynus>	 yes, I am with that
[13:53:02] <hashar>	 jynus: is it safe to have it synced?  I am running scap right now
[13:53:05] <jynus>	 takes some time
[13:53:09] <logmsgbot>	 !log hashar@tin scap aborted: testwiki to 1.28.0-wmf.7  T136973 (duration: 04m 17s)
[13:53:10] <stashbot>	 T136973: MW-1.28.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T136973
[13:53:12] <jynus>	 yes
[13:53:13] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:53:25] <jynus>	 almost 99% of my changes are idempotent
[13:53:45] <logmsgbot>	 !log hashar@tin Started scap: testwiki to 1.28.0-wmf.7 (take two) T136973
[13:53:45] <stashbot>	 T136973: MW-1.28.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T136973
[13:53:48] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:54:00] <jynus>	 well, not idempotent, but I mean they can be partially deployed/fail etc without problems
[13:54:01] <hashar>	 I have cancelled to fast :D
[13:54:27] <jynus>	 should I deploy then?
[13:54:39] <hashar>	 I guess the scap run I am handling will do it
[13:54:56] <hashar>	 it is busy rebuilding the l10ncache
[13:55:05] <jynus>	 stagging is ok right now
[13:55:20] <logmsgbot>	 !log hashar@tin scap aborted: testwiki to 1.28.0-wmf.7 (take two) T136973 (duration: 01m 35s)
[13:55:21] <stashbot>	 T136973: MW-1.28.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T136973
[13:55:24] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:55:24] <hashar>	 ah wrong window
[13:55:25] <hashar>	 ...
[13:55:35] <logmsgbot>	 !log hashar@tin Started scap: testwiki to 1.28.0-wmf.7 (take three) T136973
[13:55:35] <stashbot>	 T136973: MW-1.28.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T136973
[13:55:39] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[13:56:02] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra CQL 10.64.0.79:9042 on maps1001 is CRITICAL: Connection refused Gehel configuration in progress
[13:56:03] <icinga-wm>	 ACKNOWLEDGEMENT - cassandra service on maps1001 is CRITICAL: NRPE: Command check_cassandra-state not defined Gehel configuration in progress
[13:56:03] <icinga-wm>	 ACKNOWLEDGEMENT - kartotherian endpoints health on maps1001 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.0.79, port=6533): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) Gehel configuration in progress
[13:56:04] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on maps1001 is CRITICAL: CRITICAL: Puppet has 7 failures Gehel configuration in progress
[13:56:04] <icinga-wm>	 ACKNOWLEDGEMENT - tilerator on maps1001 is CRITICAL: Connection refused Gehel configuration in progress
[13:56:05] <icinga-wm>	 ACKNOWLEDGEMENT - tileratorui on maps1001 is CRITICAL: Connection refused Gehel configuration in progress
[13:57:32] <gehel>	 Sorry for the spam, I did not think that acknowledging alerts on a host with scheduled downtime would generate noise...
[13:58:27] <jynus>	 it does, if marked "Send notification". disable notifications != create downtime period
[13:59:03] <jynus>	 e.g. if something alerts, then you downtime, then it goes up, it will notify
[13:59:47] <grrrit-wm>	 (03CR) 10Muehlenhoff: Restart exim daily on Monday to Friday (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/294929 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[13:59:49] <gehel>	 jynus: thanks! I'll check that right now...
[14:00:07] <grrrit-wm>	 (03PS2) 10Muehlenhoff: Restart exim daily on Monday to Friday [puppet] - 10https://gerrit.wikimedia.org/r/294929 (https://phabricator.wikimedia.org/T135991) 
[14:00:46] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 032 V: 032] add prometheus VMs in eqiad/codfw [dns] - 10https://gerrit.wikimedia.org/r/295344 (https://phabricator.wikimedia.org/T136313) (owner: 10Filippo Giunchedi)
[14:02:12] <logmsgbot>	 !log hashar@tin scap failed: CalledProcessError Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki="labtestwiki" --outdir="/tmp/scap_l10n_2087727834" --threads=4 --lang en  --quiet' returned non-zero exit status 255 (duration: 06m 37s)
[14:02:16] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:03:06] <gehel>	 !log disabling alerting for maps100?\.eqiad\.wmnet during initial installation
[14:03:11] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:05:48] <icinga-wm>	 PROBLEM - Disk space on mw2243 is CRITICAL: Connection refused by host
[14:06:18] <icinga-wm>	 PROBLEM - MD RAID on mw2243 is CRITICAL: Timeout while attempting connection
[14:06:23] <logmsgbot>	 !log hashar@tin Started scap: (no message)
[14:06:28] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:07:08] <icinga-wm>	 PROBLEM - Apache HTTP on mw2243 is CRITICAL: Connection timed out
[14:07:17] <icinga-wm>	 PROBLEM - configured eth on mw2243 is CRITICAL: Timeout while attempting connection
[14:07:38] <icinga-wm>	 PROBLEM - dhclient process on mw2243 is CRITICAL: Timeout while attempting connection
[14:07:48] <icinga-wm>	 PROBLEM - mediawiki-installation DSH group on mw2243 is CRITICAL: Host mw2243 is not in mediawiki-installation dsh group
[14:08:08] <icinga-wm>	 PROBLEM - nutcracker port on mw2243 is CRITICAL: Timeout while attempting connection
[14:08:29] <icinga-wm>	 PROBLEM - nutcracker process on mw2243 is CRITICAL: Timeout while attempting connection
[14:08:47] <icinga-wm>	 PROBLEM - puppet last run on mw2243 is CRITICAL: Timeout while attempting connection
[14:08:49] <hashar>	 have to quickly rush to school. be back in 6 minutes
[14:08:58] <hashar>	 wikiversions.json is live hacked to push .7 to testwiki
[14:09:01] <hashar>	 and scap going on
[14:09:08] <icinga-wm>	 PROBLEM - salt-minion processes on mw2243 is CRITICAL: Timeout while attempting connection
[14:09:22] <logmsgbot>	 !log hashar@tin scap failed: CalledProcessError Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki="labtestwiki" --outdir="/tmp/scap_l10n_87423667" --threads=4 --lang en  --quiet' returned non-zero exit status 255 (duration: 02m 58s)
[14:09:27] <icinga-wm>	 RECOVERY - puppet last run on mw2192 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[14:09:33] <elukey>	 2243 is mine
[14:09:37] <elukey>	 new appserver
[14:09:39] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw2243 is CRITICAL: Timeout while attempting connection
[14:09:46] <elukey>	 wow I started the install this morning :/
[14:09:57] <icinga-wm>	 PROBLEM - DPKG on mw2243 is CRITICAL: Timeout while attempting connection
[14:12:57] <moritzm>	 !log depooling restbase1007 for upgrade to Linux 4.4
[14:13:01] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:14:22] <moritzm>	 !log correction: restbase1007 was already depooled for cassandra maintenance, thus only rebooting to 4.4
[14:14:26] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:14:53] <hashar>	 back
[14:17:39] <grrrit-wm>	 (03PS4) 10Gehel: Configuration for new elasticsearch servers in eqiad. [puppet] - 10https://gerrit.wikimedia.org/r/294918 
[14:21:12] <grrrit-wm>	 (03CR) 10EBernhardson: [C: 031] "other than david's comment looks sane to me" [puppet] - 10https://gerrit.wikimedia.org/r/294918 (owner: 10Gehel)
[14:22:01] <grrrit-wm>	 (03CR) 10DCausse: [C: 031] Configuration for new elasticsearch servers in eqiad. [puppet] - 10https://gerrit.wikimedia.org/r/294918 (owner: 10Gehel)
[14:22:12] <grrrit-wm>	 (03CR) 10Gehel: Configuration for new elasticsearch servers in eqiad. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/294918 (owner: 10Gehel)
[14:25:17] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp main page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp main page via mobile-sections-lead returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/mobile-sections-remaining/{title} (retrieve remaining sections of en.wp main page via mobile-s
[14:25:57] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp main page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp main page via mobile-sections-lead returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/mobile-sections-remaining/{title} (retrieve remaining sections of en.wp main page via mobile-sections-rema
[14:26:18] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp main page via mobile-sections-lead) is CRITICAL: Test retrieve lead section of en.wp main page via mobile-sections-lead returned the unexpected status 500 (expecting: 200): /{domain}/v1/page/mobile-sections-lead/{title} (retrieve lead section of en.wp Barack Obama page via mobile-sections-lead) i
[14:28:05] <godog>	 I'm taking a look at mobileapps
[14:28:28] <logmsgbot>	 !log hashar@tin Started scap: testwiki to group0 (previously was labtestwiki which does not work)
[14:28:32] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:31:07] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[14:32:08] <icinga-wm>	 PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[14:32:18] <godog>	 ah I'm assuming that's related to restarting restbase1007 (mobileapps 500s)
[14:32:27] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy
[14:32:39] <icinga-wm>	 RECOVERY - Apache HTTP on mw1284 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.002 second response time
[14:33:08] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[14:34:37] <icinga-wm>	 RECOVERY - Apache HTTP on mw2243 is OK: HTTP OK: HTTP/1.1 200 OK - 11378 bytes in 0.074 second response time
[14:35:37] <icinga-wm>	 RECOVERY - Disk space on mw2243 is OK: DISK OK
[14:35:39] <icinga-wm>	 RECOVERY - nutcracker port on mw2243 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212
[14:35:58] <icinga-wm>	 RECOVERY - nutcracker process on mw2243 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker
[14:35:59] <icinga-wm>	 RECOVERY - nutcracker process on mw1284 is OK: PROCS OK: 1 process with UID = 110 (nutcracker), command name nutcracker
[14:35:59] <icinga-wm>	 RECOVERY - MD RAID on mw2243 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[14:36:07] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1284 is OK: OK: nf_conntrack is 0 % full
[14:36:09] <icinga-wm>	 RECOVERY - Disk space on mw1284 is OK: DISK OK
[14:36:37] <icinga-wm>	 RECOVERY - MD RAID on mw1284 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0
[14:36:37] <icinga-wm>	 RECOVERY - salt-minion processes on mw2243 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[14:36:48] <icinga-wm>	 RECOVERY - salt-minion processes on mw1284 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[14:36:58] <icinga-wm>	 RECOVERY - configured eth on mw2243 is OK: OK - interfaces up
[14:37:17] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw2243 is OK: OK: nf_conntrack is 0 % full
[14:37:27] <icinga-wm>	 RECOVERY - configured eth on mw1284 is OK: OK - interfaces up
[14:37:28] <icinga-wm>	 RECOVERY - dhclient process on mw2243 is OK: PROCS OK: 0 processes with command name dhclient
[14:37:28] <icinga-wm>	 RECOVERY - DPKG on mw2243 is OK: All packages OK
[14:37:47] <icinga-wm>	 RECOVERY - nutcracker port on mw1284 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212
[14:37:57] <icinga-wm>	 RECOVERY - dhclient process on mw1284 is OK: PROCS OK: 0 processes with command name dhclient
[14:40:21] <hashar>	 scap is on sync-masters
[14:41:08] <icinga-wm>	 RECOVERY - DPKG on mw1284 is OK: All packages OK
[14:44:20] <paravoid>	 so what was up with that mobileapps page again?
[14:44:23] <paravoid>	 and why aren't we getting paged?
[14:45:36] <godog>	 I'm looking into the former question with mobileapps logs + logstash, no idea about latter though
[14:46:14] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 031] ferm: Kill INTERNAL_V4/INTERNAL_V6 definitions [puppet] - 10https://gerrit.wikimedia.org/r/295332 (owner: 10Alexandros Kosiaris)
[14:47:25] <moritzm>	 !log rolling restart of aqs service on aqs1001-aqs1006 to pick up new firejail settings
[14:47:29] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[14:48:47] <elukey>	 joal --^
[14:50:01] <grrrit-wm>	 (03CR) 10Faidon Liambotis: [C: 04-1] "I still don't see how this makes sense. What does "sphere private" really mean here? Why would that combination of networks /ever/ be usef" [puppet] - 10https://gerrit.wikimedia.org/r/295333 (owner: 10Alexandros Kosiaris)
[14:50:08] <yurik_>	 thcipriani, around?
[14:51:12] <wikibugs>	 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup/deploy new codfw mw app servers - https://phabricator.wikimedia.org/T135466#2396678 (10Papaul)
[14:52:52] <thcipriani>	 yurik_: yup, what's up?
[14:53:04] <yurik_>	 thcipriani, hey, want to do graph spec3 later today?
[14:53:12] <wikibugs>	 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup/deploy new codfw mw app servers - https://phabricator.wikimedia.org/T135466#2299896 (10Papaul) a:05Papaul>03Joe OS installation complete on all the hosts puppet cert and salt-key complete as well.
[14:53:56] <hashar>	 thcipriani: good morning! scap is still going on.  I have lost time trying to figure out a very lame mistake / typo :D
[14:54:10] <hashar>	 sec patches applied for sure,  yurik_ is willing to get a security patch of some sort added
[14:54:29] <hashar>	 and Roan is adding a few changes to Echo but I guess he will reach a working state by the time of deployment
[14:54:55] <thcipriani>	 hashar: ack, thanks for taking care of all that :)
[14:55:50] <thcipriani>	 yurik_: yup if you're up for it, I've got the puppet part (https://gerrit.wikimedia.org/r/#/c/294357/) if you've got the graphoid ./scap dir part :)
[14:56:42] <hashar>	 syncapaches is 44% done
[14:57:14] <yurik_>	 thcipriani, there is only one patch for swat - we could bug gehel to merge any needed puppet stuff
[14:57:18] <yurik_>	 if he's around :)
[14:57:33] * yurik_ goes to look at the scap dir thingy again
[14:58:26] <RoanKattouw>	 I only have one Echo patch, and I've got it lined up, I just need to wait for scap to finish
[15:00:04] <jouncebot>	 anomie, ostriches, thcipriani, marktraceur, and Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160621T1500). Please do the needful.
[15:00:04] <jouncebot>	 kart_: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process.
[15:00:21] <kart_>	 around
[15:00:44] <thcipriani>	 kart_: holding SWAT until scap is complete.
[15:01:26] <wikibugs>	 06Operations, 10ops-codfw, 10hardware-requests: procure syslog hardware in codfw - https://phabricator.wikimedia.org/T138075#2396708 (10RobH) a:03RobH
[15:01:33] <hashar>	 70%
[15:01:38] <hashar>	 well
[15:01:42] <hashar>	 at worth we can cancel the scap
[15:01:56] <hashar>	 the 30% remaining of scap to testwiki can be done again later
[15:02:07] <kart_>	 thcipriani: Sure
[15:02:18] <thcipriani>	 hashar: I'd rather let it complete
[15:03:01] <hashar>	 lets stream the progress on hangout https://hangouts.google.com/hangouts/_/wikimedia.org/scap  :D
[15:03:15] <kart_>	 joining.
[15:03:25] <kart_>	 :)
[15:03:27] <hashar>	 that the modern way of sharing a display bar
[15:03:31] <hashar>	 a progress bar
[15:03:37] <hashar>	 80 left
[15:04:24] <icinga-wm>	 RECOVERY - puppet last run on mw1284 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures
[15:05:10] <wikibugs>	 06Operations, 06Services: mobileapps 500s following reboot of restbase1007 - https://phabricator.wikimedia.org/T138314#2396741 (10fgiunchedi)
[15:07:25] <hashar>	 cdb rebuild
[15:07:33] <hashar>	 thcipriani: kart_ really sorry about the added delay :-
[15:07:47] <hashar>	 I screwed up the first scap by switcing labtestwiki instead of testwiki
[15:07:56] <hashar>	 turns out it causes a fatal eventually :D
[15:08:50] <kart_>	 hashar: no worries at all.
[15:09:22] <thcipriani>	 hashar: hmm, weird, but probably a good thing. I like the hangout!
[15:09:25] <yurik_>	 thcipriani, poke me when you want to play with the graph stuff
[15:09:32] <yurik_>	 i'm getting it ready in the mean time
[15:09:49] <thcipriani>	 yurik_: ack, sounds good, I'll be ready post-SWAT most likely :)
[15:12:16] <jynus>	 I'd like to remind you that my change is merged but not deployed
[15:12:28] <grrrit-wm>	 (03PS1) 10Elukey: Add new MW appservers to the scap DSH list. [puppet] - 10https://gerrit.wikimedia.org/r/295353 
[15:12:30] <gehel>	 yurik_: sorry, afk atm. You need me to review / merge something?
[15:12:44] <icinga-wm>	 RECOVERY - puppet last run on mw2243 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures
[15:12:56] <yurik_>	 gehel, we want to switch graphoid service to scap3, same as before
[15:15:53] <icinga-wm>	 PROBLEM - Apache HTTP on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:16:53] <icinga-wm>	 PROBLEM - puppet last run on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:16:54] <icinga-wm>	 PROBLEM - HHVM rendering on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:17:03] <hashar>	 thcipriani: I dont know what is wrong with the last host 
[15:17:24] <icinga-wm>	 PROBLEM - Check size of conntrack table on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:17:34] <icinga-wm>	 PROBLEM - SSH on mw1131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:17:49] <grrrit-wm>	 (03CR) 10Elukey: [C: 032] Add new MW appservers to the scap DSH list. [puppet] - 10https://gerrit.wikimedia.org/r/295353 (owner: 10Elukey)
[15:17:53] <icinga-wm>	 PROBLEM - salt-minion processes on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:17:54] <icinga-wm>	 PROBLEM - configured eth on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:18:11] <thcipriani>	 hashar: looks like mw1131 is having a bad time
[15:18:13] <icinga-wm>	 PROBLEM - dhclient process on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:18:33] <icinga-wm>	 PROBLEM - nutcracker port on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:18:34] <icinga-wm>	 PROBLEM - DPKG on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:18:43] <thcipriani>	 hashar: you should login in a new term and kill 26562
[15:18:45] <icinga-wm>	 PROBLEM - nutcracker process on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:18:54] <icinga-wm>	 PROBLEM - HHVM processes on mw1131 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds.
[15:19:08] <thcipriani>	 (that's the process that's running ssh mw1131 scap cdb-rebuild)
[15:19:24] <hashar>	 should I abort ?
[15:19:40] <thcipriani>	 no, open a new term and just kill 26562
[15:19:56] <greg-g>	 what's the timeout on those?
[15:19:57] <hashar>	 done
[15:20:13] <icinga-wm>	 RECOVERY - configured eth on mw1131 is OK: OK - interfaces up
[15:20:14] <logmsgbot>	 !log hashar@tin Finished scap: testwiki to group0 (previously was labtestwiki which does not work) (duration: 51m 45s)
[15:20:14] <icinga-wm>	 RECOVERY - dhclient process on mw1131 is OK: PROCS OK: 0 processes with command name dhclient
[15:20:18] <hashar>	 15:19:54 sudo -u mwdeploy -n -- /usr/bin/scap cdb-rebuild on mw1131.eqiad.wmnet returned [143]: l10n merge:   0% (ok: 0; fail: 0; left: 393)                                                                                                                      
[15:20:18] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:20:24] <hashar>	 thcipriani: kart_ SWAT open
[15:20:25] <hashar>	 sorry :(
[15:20:29] <thcipriani>	 hashar: cool, thanks :)
[15:20:34] <icinga-wm>	 RECOVERY - nutcracker port on mw1131 is OK: TCP OK - 0.000 second response time on port 11212
[15:20:44] <icinga-wm>	 RECOVERY - DPKG on mw1131 is OK: All packages OK
[15:20:54] <icinga-wm>	 RECOVERY - nutcracker process on mw1131 is OK: PROCS OK: 1 process with UID = 108 (nutcracker), command name nutcracker
[15:21:03] <icinga-wm>	 RECOVERY - HHVM processes on mw1131 is OK: PROCS OK: 6 processes with command name hhvm
[15:21:14] <icinga-wm>	 RECOVERY - puppet last run on mw1131 is OK: OK: Puppet is currently enabled, last run 35 minutes ago with 0 failures
[15:21:45] <icinga-wm>	 RECOVERY - Check size of conntrack table on mw1131 is OK: OK: nf_conntrack is 0 % full
[15:21:54] <icinga-wm>	 RECOVERY - SSH on mw1131 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7 (protocol 2.0)
[15:22:15] <icinga-wm>	 RECOVERY - salt-minion processes on mw1131 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion
[15:23:42] <elukey>	 this one is not a new appserver --^
[15:23:58] <elukey>	 ah ok just read the backlog
[15:24:45] <kart_>	 thcipriani: we can start with test hosts, but you can review the patch and determine if that's needed. we don't have dblist today.
[15:25:36] <thcipriani>	 kart_: ack. give me 1 second to do some quick cleanup.
[15:25:45] <hashar>	 looks like scap killed mw1131 somehow
[15:25:49] <hashar>	 anyway testwiki is switched
[15:25:50] <hashar>	 all set 
[15:25:59] <hashar>	 thcipriani: I am rushing out ! have safe deploy!
[15:26:13] <hashar>	 dapatrick might have some more patches to apply
[15:26:29] <thcipriani>	 hashar: thank you!
[15:27:51] <dapatrick>	 None for me this week, unless we have some emergency bug reports.
[15:28:05] <dapatrick>	 hashar ^^
[15:29:03] <grrrit-wm>	 (03PS1) 10Faidon Liambotis: openldap: enable the memberof overlay [puppet] - 10https://gerrit.wikimedia.org/r/295357 
[15:29:06] <paravoid>	 moritzm: hey
[15:29:59] <gehel>	 yurik_: I'll be available in 5-10'...
[15:30:12] <yurik_>	 thx gehel !
[15:30:17] <yurik_>	 i'm getting the patches ready
[15:30:37] <thcipriani>	 kart_: ok, let's get started :)
[15:31:54] <grrrit-wm>	 (03PS2) 10Thcipriani: Deploy Compact Language Links as default (Stage 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295334 (https://phabricator.wikimedia.org/T136677) (owner: 10KartikMistry)
[15:32:25] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295334 (https://phabricator.wikimedia.org/T136677) (owner: 10KartikMistry)
[15:32:59] <gehel>	 yurik_: I'm here, sorry for the delay
[15:33:01] <grrrit-wm>	 (03Merged) 10jenkins-bot: Deploy Compact Language Links as default (Stage 1) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295334 (https://phabricator.wikimedia.org/T136677) (owner: 10KartikMistry)
[15:33:26] <kart_>	 thcipriani: sure
[15:33:43] <yurik_>	 gehel, no worries, still taking a few minutes to set up the scap3 patches for graphoid. Also, i think thcipriani has created some puppet patch for graphoid 
[15:34:25] <thcipriani>	 jynus: want me to sync Repool db1068 with low weight; depool db1061 and db1062 ?
[15:35:44] <icinga-wm>	 PROBLEM - puppet last run on kafka2002 is CRITICAL: CRITICAL: puppet fail
[15:35:45] <icinga-wm>	 RECOVERY - Unmerged changes on repository mediawiki_config on tin is OK: No changes to merge.
[15:38:30] <thcipriani>	 kart_: change is on mw1017 now
[15:39:05] <kart_>	 thcipriani: checking. 
[15:39:33] <kart_>	 thcipriani: I need to enable x-mw-debug and test on specific WP we deployed, right?
[15:39:42] <kart_>	 Looks good with en.wikivoyage.
[15:39:48] <thcipriani>	 kart_: yup.
[15:39:53] <thcipriani>	 spot-check with mwrepl looks ok
[15:40:14] <icinga-wm>	 RECOVERY - puppet last run on kafka2002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[15:40:32] <thcipriani>	 kart_: lemme know when you're ready for me to sync everywhere
[15:40:40] <jynus>	 thcipriani, either you do it or I can, when there is a hole
[15:40:54] <icinga-wm>	 PROBLEM - puppet last run on labsdb1003 is CRITICAL: CRITICAL: puppet fail
[15:41:14] <kart_>	 thcipriani: few seconds more, checking negative test.
[15:41:19] <yurik_>	 thcipriani, https://gerrit.wikimedia.org/r/295358 
[15:41:53] <thcipriani>	 jynus: syncing now
[15:42:16] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/db-eqiad.php: Repool db1068 with low weight; depool db1061 and db1062 (duration: 00m 30s)
[15:42:21] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:42:41] <jynus>	 thank you, I would say "checking", but I do that all the time
[15:42:54] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1284 is OK: OK
[15:43:12] <kart_>	 thcipriani: go ahead.
[15:43:24] <icinga-wm>	 RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge.
[15:44:35] <thcipriani>	 kart_: doing.
[15:44:50] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:295334|Deploy Compact Language Links as default (Stage 1)]] (duration: 00m 25s)
[15:44:54] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:44:54] <thcipriani>	 ^ kart_ check please
[15:46:52] <kart_>	 thcipriani: thanks. Checking.
[15:47:46] <yurik_>	 thcipriani, i do want to deploy it to beta cluster, but i'm not sure the servers there are properly setup yet
[15:48:17] <yurik_>	 basically deployment-graphoid.deployment-prep.eqiad.wmflabs does not exist yet :)
[15:48:37] <thcipriani>	 yurik_: heh, yeah, that sounds like a blocker :P
[15:49:28] <yurik_>	 thcipriani, ok, merging. gehel, go ahead and enable it, if thcipriani allows?
[15:49:38] <yurik_>	 s/allows/ok with it :)
[15:49:56] <gehel>	 yurik_: remind me of the context...
[15:50:07] <yurik_>	 gehel, scap3 for graphoid service
[15:50:32] <yurik_>	 gehel, https://gerrit.wikimedia.org/r/#/c/294357
[15:50:41] <thcipriani>	 gehel: so puppet would need...yeah ^ that
[15:50:54] <gehel>	 ok, looking...
[15:51:28] <kart_>	 thcipriani: sorry, looks good.
[15:51:36] <thcipriani>	 kart_: np, thanks for checking :)
[15:51:53] <kart_>	 thcipriani: thanks a lot. x-mw-debug thing must be used for all.
[15:52:36] <thcipriani>	 yurik_: before puppet runs on the graphoid hosts you have to do: scap deploy --init (after your /deploy patch merges and is on tin)
[15:53:19] <logmsgbot>	 !log catrope@tin Synchronized php-1.28.0-wmf.7/extensions/Echo/: (no message) (duration: 00m 33s)
[15:53:22] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:53:30] <gehel>	 yurik_, thcipriani: lgtm, merging
[15:53:52] <grrrit-wm>	 (03PS2) 10Gehel: Deploy Graphoid with Scap3 [puppet] - 10https://gerrit.wikimedia.org/r/294357 (owner: 10Thcipriani)
[15:53:57] <yurik_>	 thcipriani, doing it now
[15:55:03] <yurik_>	 thcipriani, done
[15:55:31] <grrrit-wm>	 (03CR) 10Gehel: [C: 032] Deploy Graphoid with Scap3 [puppet] - 10https://gerrit.wikimedia.org/r/294357 (owner: 10Thcipriani)
[15:55:38] <thcipriani>	 cool, should be good to run puppet on graphoid targets when ^ merges
[15:57:12] <gehel>	 thcipriani, yurik_: I seem to remember needing a puppet run on tin as well (where are my notes when I need them?)
[15:57:37] <thcipriani>	 gehel: yup, a puppet run on tin first for housekeeping
[15:58:18] <gehel>	 !log puppet run on tin to enable scap3 deployment for graphoid
[15:58:22] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[15:58:31] <elukey>	 thcipriani: still working on mw1131
[15:58:33] <elukey>	 ?
[15:59:04] <thcipriani>	 elukey: no, unless RoanKattouw has any more Echo stuff, SWAT should be complete
[15:59:17] <RoanKattouw>	 I synced my one Echo thing
[15:59:19] <RoanKattouw>	 so I'm done
[16:00:04] <jouncebot>	 godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160621T1600). Please do the needful.
[16:00:44] <elukey>	 thcipriani: ah ok because I can still see a critical for hhvm
[16:00:48] <elukey>	 will restart it
[16:01:10] <gehel>	 thcipriani, yurik_: puppet run on tin complete
[16:02:04] <thcipriani>	 gehel: ack, thanks. can you run on graphoid nodes too please?
[16:03:11] <gehel>	 thcipriani: scb[12]00[12] ?
[16:04:02] <thcipriani>	 gehel: yup that looks correct based on yurik_ 's patch
[16:04:15] <gehel>	 thcipriani: running right now...
[16:05:05] <gehel>	 done
[16:05:35] <thcipriani>	 gehel: thank you!
[16:05:46] <gehel>	 thcipriani: at your service...
[16:06:09] <thcipriani>	 yurik_: could you spot-check scb2001 to make sure that /srv/deployment/graphoid is owned by deploy-service?
[16:06:17] <icinga-wm>	 RECOVERY - puppet last run on labsdb1003 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures
[16:06:32] * yurik_ looks
[16:07:13] <gehel>	 thcipriani: checked on scb1001, looks good
[16:07:28] <thcipriani>	 gehel: cool, thanks :)
[16:07:36] <yurik_>	 thcipriani, yep
[16:07:40] <yurik_>	 hehe
[16:07:54] <gehel>	 actually, looks good on all 4 servers
[16:07:54] <thcipriani>	 yurik_: feel free to pull the trigger: scap deploy -v
[16:08:02] * thcipriani watches
[16:08:12] * gehel crosses fingers
[16:09:20] <thcipriani>	 should deploy to scb2001.codfw.wmnet, restart service there, check 19000 is accepting connections, prompt to continue, then do the others
[16:10:17] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw2243 is OK: OK
[16:12:01] * yurik_ tries graphoid scap3
[16:16:15] <yurik_>	 finishing syncing
[16:16:27] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1285 is OK: OK
[16:16:27] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1286 is OK: OK
[16:16:40] <thcipriani>	 yurik_: nice :)
[16:17:21] <thcipriani>	 gehel: yurik_ thanks again for all the work to port to scap3, much obliged.
[16:17:45] <yurik_>	 thanks thcipriani !
[16:17:47] * gehel did not do much... but gaind a bit of knowledge in the process...
[16:18:27] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1274 is OK: OK
[16:19:38] <icinga-wm>	 RECOVERY - mediawiki-installation DSH group on mw1283 is OK: OK
[16:24:53] <grrrit-wm>	 (03PS5) 10Gehel: Configuration for new elasticsearch servers in eqiad. [puppet] - 10https://gerrit.wikimedia.org/r/294918 
[16:31:10] <grrrit-wm>	 (03CR) 10Gehel: [C: 032] Configuration for new elasticsearch servers in eqiad. [puppet] - 10https://gerrit.wikimedia.org/r/294918 (owner: 10Gehel)
[16:32:05] <gehel>	 !log starting installation of new elasticsearch server elastic1032.eqiad.wmnet
[16:32:09] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:33:05] <yurik_>	 !log deployed and restarted graphoid with scap3
[16:33:09] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[16:35:18] <icinga-wm>	 RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0]
[16:38:13] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, good job!" [puppet] - 10https://gerrit.wikimedia.org/r/291819 (owner: 10Alexandros Kosiaris)
[16:52:25] <grrrit-wm>	 (03CR) 10Filippo Giunchedi: [C: 031] "two small nits in comments, LGTM otherwise, also verified via PCC for https://gerrit.wikimedia.org/r/#/c/291819/ https://puppet-compiler.w" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis)
[16:53:52] <grrrit-wm>	 (03PS1) 10Urbanecm: Temporary IP Cap Lift on es.wiki and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295365 (https://phabricator.wikimedia.org/T138322) 
[16:56:27] <Urbanecm>	 Hi, can somebody deploy https://gerrit.wikimedia.org/r/#/c/295365/ for eswiki? This is throttle rule for event that's held today. See T138322 for details. 
[16:56:27] <stashbot>	 T138322: Temporary IP Cap Lift on es.wiki and commons - https://phabricator.wikimedia.org/T138322
[16:58:33] <Urbanecm>	 Ping: thcipriani Krenair anomie 
[16:59:54] * thcipriani looks
[17:00:04] <jouncebot>	 yurik, gwicke, cscott, arlolra, and subbu: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160621T1700).
[17:00:05] <wikibugs>	 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, 03Discovery-Search-Sprint: Install and configure new elasticsearch servers in eqiad - https://phabricator.wikimedia.org/T138329#2397068 (10Gehel)
[17:00:38] <subbu>	 no deploy today.
[17:00:51] <wikibugs>	 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, 03Discovery-Search-Sprint: Install and configure new elasticsearch servers in eqiad - https://phabricator.wikimedia.org/T138329#2397068 (10Gehel) Configuration of new servers was done in https://gerrit.wikimedia.org/r/#/c/294918/ (so...
[17:01:40] <grrrit-wm>	 (03PS32) 10Filippo Giunchedi: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis)
[17:01:43] <grrrit-wm>	 (03PS1) 10Filippo Giunchedi: syslog: limit source range to $PRODUCTION_NETWORKS [puppet] - 10https://gerrit.wikimedia.org/r/295368 
[17:02:07] <Urbanecm>	 subbu: Why? This cannot be deployed in no usual deploy window because I have no time from eswiki to schedule it. The event (which needs one of throttle rules I've added) is held today 13:00 UTC-5. This cannot be done in any SWAT (and mind that I'm in Europe). 
[17:02:13] <wikibugs>	 06Operations, 06Discovery, 06Discovery-Search-Backlog, 10Elasticsearch, 03Discovery-Search-Sprint: Install and configure new elasticsearch servers in eqiad - https://phabricator.wikimedia.org/T138329#2397100 (10Gehel) elastic1032 is installed and configured. It joined the cluster without issues and is st...
[17:02:38] <subbu>	 Urbanecm, sorry .. I meant: we aren't deploying parsoid today.
[17:02:56] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 032] Temporary IP Cap Lift on es.wiki and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295365 (https://phabricator.wikimedia.org/T138322) (owner: 10Urbanecm)
[17:03:43] <grrrit-wm>	 (03Merged) 10jenkins-bot: Temporary IP Cap Lift on es.wiki and commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295365 (https://phabricator.wikimedia.org/T138322) (owner: 10Urbanecm)
[17:03:44] <Urbanecm>	 subbu: Ok :)
[17:04:02] <Urbanecm>	 Thanks for explanation
[17:04:37] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: network: add $production_networks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis)
[17:05:01] <yurik_>	 deploying graphoid and tilerator
[17:06:47] <logmsgbot>	 !log thcipriani@tin Synchronized wmf-config/throttle.php: [[gerrit:295365|Temporary IP Cap Lift on es.wiki and commons]] (duration: 00m 24s)
[17:06:55] <thcipriani>	 ^ Urbanecm 
[17:07:12] <Urbanecm>	 Thanks for your deploy thcipriani :)
[17:07:26] <thcipriani>	 Urbanecm: thanks for keeping up with these on short notice :)
[17:07:47] <Urbanecm>	 You're welcome :). 
[17:08:05] <paravoid>	 akosiaris, godog: so, I think we should just redefine INTERNAL for these purposes
[17:08:10] <paravoid>	 and not introduce production_networks
[17:08:47] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS on elastic1032 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:IO::Socket::SSL: connect: Connection refused
[17:09:05] <grrrit-wm>	 (03CR) 10Alexandros Kosiaris: [C: 031] Postgresql: init database with Puppet [puppet] - 10https://gerrit.wikimedia.org/r/295343 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel)
[17:10:27] <grrrit-wm>	 (03PS33) 10Alexandros Kosiaris: network: add $production_networks [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis)
[17:10:30] <godog>	 paravoid: yeah that's probably better, no realm in the variable
[17:10:48] <yurik_>	 !log deployed graphoid https://gerrit.wikimedia.org/r/#/c/295367/
[17:10:52] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:11:22] <paravoid>	 and set $internal_networks to slice_network_constants($::realm)
[17:12:33] <akosiaris>	 paravoid: I 'd like to kill INTERNAL tbh
[17:12:51] <akosiaris>	 it's badly named and easily misunderstood/misinterpreted
[17:13:17] <godog>	 REALM_NETWORKS ?
[17:13:21] <paravoid>	 well, ok, that's true
[17:13:27] <paravoid>	 but production_networks won't work with labs
[17:13:53] <akosiaris>	 labs ? as in labs VMs ?
[17:13:58] <paravoid>	 so having a realm-dependent variable which means "protected to the internal of our network, not accessible by the internet" would be useful
[17:14:11] <paravoid>	 in Labs instances that happen to use ops/puppet code to set up something
[17:16:05] <godog>	 yeah I agree internal is a bit misleading, realm_networks might do
[17:16:43] <akosiaris>	 so, we have various needs. For example we do want a PRODUCTION_NETWORKS and a LABS_NETWORKS structure in production
[17:16:51] <logmsgbot>	 !log thcipriani@tin Synchronized php-1.28.0-wmf.7/extensions/Graph/lib/graph2.compiled.js: pre-train backport: [[gerrit:295366|Updated to latest graph2 lib]] (duration: 00m 31s)
[17:16:54] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[17:16:56] <akosiaris>	 the LABS_NETWORKS used mostly in labstores and labsdbs
[17:17:20] <akosiaris>	 and anything else we might want to share infra
[17:18:07] <akosiaris>	 so in that case REALM_NETWORKS == PRODUCTION_NETWORKS but we do need something extra as well. At which level exactly I am not sure 
[17:18:43] <akosiaris>	 but it probably does makes sense to do it someplace somewhat central
[17:18:56] <godog>	 yeah, for cross-realm ferm rules in production we could do REALM_NETWORKS + LABS_NETWORKS
[17:19:24] <akosiaris>	 do we have the reverse need in labs ? 
[17:19:25] <godog>	 I mean the labstore/labdb case is sort of production by definition as I see it
[17:19:40] <akosiaris>	 I don't think so, but please do correct me if I am wrong
[17:19:49] <godog>	 I hope we are not going to access/rely labs instances from production but yeah there could be exceptions
[17:20:02] <godog>	 not from the internal network anyway
[17:20:29] <akosiaris>	 ah, yes there are. CI
[17:20:45] <grrrit-wm>	 (03PS1) 10Gehel: Adding missing dependency in exposing puppet SSL certs on elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/295369 (https://phabricator.wikimedia.org/T138329) 
[17:21:01] <akosiaris>	 well... CI from private ? or public. right now it's public, but contint1001 will not be
[17:22:02] <godog>	 mhhh CI running in production reaching out to labs instances via the non-public labs addresses?
[17:22:57] <akosiaris>	 I think so at least. hashar is reworking on the CI architecture these days
[17:23:21] <thcipriani>	 we actually have some open questions about the new CI architecture: https://phabricator.wikimedia.org/T133300
[17:23:22] <akosiaris>	 good timing at least
[17:24:31] <godog>	 indeedly!
[17:25:08] <thcipriani>	 not clear if we're going to be using contint1001 since it can't reach labs instances from its network(?)
[17:26:30] <grrrit-wm>	 (03CR) 10Andrew Bogott: "Can we please just bump this to 10,000 and then forget about it for another year or two? The migration from opendj to openldap applied a " [puppet] - 10https://gerrit.wikimedia.org/r/295198 (https://phabricator.wikimedia.org/T122595) (owner: 10Muehlenhoff)
[17:29:26] <godog>	 afaik it shouldn't via the internal labs addresses no
[17:30:27] <godog>	 don't quote me on that though :)
[17:30:31] <wikibugs>	 06Operations, 10Traffic, 10Wikimedia-Logstash: Move logstash to an LVS service - https://phabricator.wikimedia.org/T132458#2397252 (10bd808) This might be something to look at doing as part of {T138328}.
[17:31:13] <akosiaris>	 I agree, labs private IPs should be not be reachable from production private IPs and vice versa
[17:33:23] <chasemp>	 there are servers within prod private ip space accessible from private ip space in labs, primarily labs-support vlan things and ldap servers
[17:33:28] <thcipriani>	 yup. so we're thinking of moving to scandium which is in the labs host network, but since we do rely on the varnish misc cache: we're not sure if we can move into that network.
[17:34:19] <thcipriani>	 I think that's our main open question: whether we can still be behind the varnish cache if we move to scandium.
[17:35:53] <wikibugs>	 06Operations, 10ORES, 06Revision-Scoring-As-A-Service: ORES should advertise swagger specs under /?spec - https://phabricator.wikimedia.org/T137804#2397269 (10Halfak) https://github.com/wiki-ai/ores/pull/151
[17:36:56] <wikibugs>	 06Operations, 10Traffic, 10Wikimedia-Logstash: Move logstash to an LVS service - https://phabricator.wikimedia.org/T132458#2397273 (10bd808) Related: {T113104}
[17:38:22] <chasemp>	 thcipriani: you want scandium on a private vlan, acccessible by labs VM's, able to access labs VM's (22 only?), and able to be behind varnish for gerrit/jenkins?
[17:40:28] <thcipriani>	 chasemp: that is mostly my understanding. Most of my understanding comes from hasharAway so small bits of that may not be true, but I think the broad strokes are correct.
[17:41:20] <chasemp>	 thcipriani: why are we putting scandium is labs-hosts1-b-eqiad?
[17:41:41] <chasemp>	 in even
[17:42:04] <thcipriani>	 I don't understand the question
[17:42:09] <akosiaris>	 it's in labs-support1 btw
[17:42:39] <chasemp>	 labs-hosts is mostly for openstack inf, nodepool was included there for that reason and it has the labvirts and is also the transit network for actual labs vm's
[17:42:56] <chasemp>	 labs-support is generally services we consider production that provide functionality to labs vm's
[17:43:01] <chasemp>	 i.e. nfs, etc
[17:43:40] <chasemp>	 so labs-support seems ideologically the right place and there isn't a reason it couldn't be behind varnish, other than it may not be setup afaik
[17:43:51] <chasemp>	 I don't get why promethium is in labs-hosts
[17:44:34] <chasemp>	 but it's not a good misc services or misc things vlan 
[17:44:44] <godog>	 I have to go, will read the tl;dr on tasks :)
[17:50:34] <chasemp>	 it seems like crossed wires
[17:50:43] <chasemp>	 https://phabricator.wikimedia.org/T133300#2380886 indicates labs-support
[17:50:58] <chasemp>	 https://phabricator.wikimedia.org/T133300#2382725 hashar calls out labs host network
[17:51:12] <chasemp>	 these are functionallity separate things in that there is an actual labs-support and labs-hosts
[17:51:18] <chasemp>	 so I think there is confusion there
[17:51:32] <chasemp>	 anyhoo
[17:52:07] <thcipriani>	 yeah, in our discussions there has definitely been some confusion about the network layout. I'm largely unfamiliar with these groupings.
[17:53:26] <thcipriani>	 chasemp: would you be able to write a bit on that task to clear up the distinction between labs-host and labs-support + the varnish info you mentioned earlier?
[17:54:06] <chasemp>	 yes but I have some other various questions I think and so I need to reread from teh beginning
[17:54:12] <grrrit-wm>	 (03CR) 10Muehlenhoff: "As for the former comments wrt dropping INTERNAL, let's do that in a followup patch. Once this patch is merged I can review/change existin" [puppet] - 10https://gerrit.wikimedia.org/r/260926 (https://phabricator.wikimedia.org/T122396) (owner: 10Faidon Liambotis)
[17:54:16] <chasemp>	 this is like day 2 after 2 weeks away so I'm out of hte loop
[17:54:43] <thcipriani>	 you missed so much CI-related fun :)
[17:56:46] <akosiaris>	 paravoid: godog: https://etherpad.wikimedia.org/p/realm_networks
[17:57:09] <akosiaris>	 I 've put an effort to approach the problem there plus some proposed solutions. Please comment
[17:57:16] <akosiaris>	 and now I am off for the day 
[17:59:29] <grrrit-wm>	 (03PS2) 10Gehel: Postgresql: init database with Puppet [puppet] - 10https://gerrit.wikimedia.org/r/295343 (https://phabricator.wikimedia.org/T138092) 
[18:01:08] <grrrit-wm>	 (03CR) 10Gehel: [C: 032] Postgresql: init database with Puppet [puppet] - 10https://gerrit.wikimedia.org/r/295343 (https://phabricator.wikimedia.org/T138092) (owner: 10Gehel)
[18:02:23] <grrrit-wm>	 (03PS1) 10BBlack: varnish: burn more cpu/mem on better gzip compression [puppet] - 10https://gerrit.wikimedia.org/r/295372 
[18:04:23] <grrrit-wm>	 (03PS1) 10Gehel: Revert "Postgresql: init database with Puppet" [puppet] - 10https://gerrit.wikimedia.org/r/295373 
[18:04:31] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] varnish: burn more cpu/mem on better gzip compression [puppet] - 10https://gerrit.wikimedia.org/r/295372 (owner: 10BBlack)
[18:05:00] <grrrit-wm>	 (03CR) 10Gehel: [C: 032] "Dependency cycle issue not detected by puppet compiler, reverting" [puppet] - 10https://gerrit.wikimedia.org/r/295373 (owner: 10Gehel)
[18:05:12] <grrrit-wm>	 (03CR) 10Gehel: [V: 032] "Dependency cycle issue not detected by puppet compiler, reverting" [puppet] - 10https://gerrit.wikimedia.org/r/295373 (owner: 10Gehel)
[18:06:07] <grrrit-wm>	 (03PS2) 10Gehel: Revert "Postgresql: init database with Puppet" [puppet] - 10https://gerrit.wikimedia.org/r/295373 
[18:06:23] <grrrit-wm>	 (03CR) 10Gehel: [V: 032] Revert "Postgresql: init database with Puppet" [puppet] - 10https://gerrit.wikimedia.org/r/295373 (owner: 10Gehel)
[18:23:47] <grrrit-wm>	 (03PS1) 10Ori.livneh: Optimize mobile static images [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295374 
[18:35:29] <wikibugs>	 07Blocked-on-Operations, 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: adywiki is missing the associated adywiki_p database with appropriate views - https://phabricator.wikimedia.org/T135029#2286195 (10Gehel)
[18:36:07] <grrrit-wm>	 (03CR) 10BBlack: [C: 031] "+1 for zopfli awesomeness :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295374 (owner: 10Ori.livneh)
[18:37:25] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 678 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5206350 keys - replication_delay is 678
[18:37:28] <grrrit-wm>	 (03PS1) 10BBlack: caches: tcp_notsent_lowat => 128K [puppet] - 10https://gerrit.wikimedia.org/r/295376 
[18:39:34] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5150037 keys - replication_delay is 0
[18:39:59] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] caches: tcp_notsent_lowat => 128K [puppet] - 10https://gerrit.wikimedia.org/r/295376 (owner: 10BBlack)
[18:41:44] <icinga-wm>	 PROBLEM - puppet last run on cp3044 is CRITICAL: CRITICAL: puppet fail
[18:48:14] <icinga-wm>	 RECOVERY - puppet last run on cp3044 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures
[18:50:55] <bblack>	 !log enabled tcp_notsent_lowat optimization on all caches (marking this time for investigation of perf graphs later) - https://gerrit.wikimedia.org/r/#/c/295376/
[18:50:58] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[18:56:13] <wikibugs>	 07Blocked-on-Operations, 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: adywiki and jamwiki are missing the associated *_p databases with appropriate views - https://phabricator.wikimedia.org/T135029#2397393 (10MaxSem)
[19:00:04] <jouncebot>	 hashar: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160621T1900).
[19:02:32] <thcipriani>	 train time!
[19:02:54] <ori>	         ____                           
[19:02:54] <ori>	    _||__|  |  ______   ______   ______ 
[19:02:54] <ori>	   (        | |      | |      | |      |
[19:02:56] <ori>	   /-()---() ~ ()--() ~ ()--() ~ ()--()
[19:06:16] <wikibugs>	 06Operations, 10ops-codfw, 10hardware-requests: procure syslog hardware in codfw - https://phabricator.wikimedia.org/T138075#2397399 (10RobH) So all of the spare hardware currently in codfw far exceeds that of lithium.eqiad.wmnet.  lithium: lithium is a Central syslog server (role::syslog::centralserver) Sin...
[19:06:23] <MatmaRex>	 :o
[19:08:07] <grrrit-wm>	 (03PS1) 10Thcipriani: Group0 to 1.28.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295380 
[19:11:50] <grrrit-wm>	 (03CR) 10Thcipriani: [C: 032] Group0 to 1.28.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295380 (owner: 10Thcipriani)
[19:12:27] <grrrit-wm>	 (03Merged) 10jenkins-bot: Group0 to 1.28.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295380 (owner: 10Thcipriani)
[19:14:12] <logmsgbot>	 !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to 1.28.0-wmf.7
[19:14:16] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:14:25] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on logstash1001 is CRITICAL: CRITICAL - elasticsearch http://10.64.0.122:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.64.0.122, port=9200): Read timed out. (read timeout=4)
[19:17:35] <dcausse>	 logstash1001 OOMed :/
[19:17:43] <bd808>	 !log Restarted ElasticSearch on logstash1001; dead from OOM
[19:17:47] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:18:13] <bd808>	 I think I caused it dcausse. I was pointing a Kibana4 instance at it for testing
[19:18:19] <logmsgbot>	 !log thcipriani@tin Purged l10n cache for 1.28.0-wmf.5
[19:18:34] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on logstash1001 is OK: OK - elasticsearch status production-logstash-eqiad: status: green, number_of_nodes: 6, unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, timed_out: False, active_primary_shards: 49, cluster_name: production-logstash-eqiad, relocating_shards: 0, active_shards: 147, initializing_shards: 0, number_of_data_nodes: 3, delayed_unassigned_sh
[19:18:38] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:18:50] <dcausse>	 bd808: you mean queries from kibana4 killed elastic?
[19:19:01] <bd808>	 it looks like it, yes
[19:19:06] <dcausse>	 doh... :/
[19:20:11] <bd808>	 I haven't played with kibana4 for about a year but it seems to be just as gross as I remembered
[19:22:54] <icinga-wm>	 PROBLEM - logstash process on logstash1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 998 (logstash), command name java, args logstash
[19:23:12] <bd808>	 yuck. what's busted now?
[19:25:03] <icinga-wm>	 RECOVERY - logstash process on logstash1001 is OK: PROCS OK: 1 process with UID = 998 (logstash), command name java, args logstash
[19:25:33] <dcausse>	 bd808: logstash stops by itself if it fails too many times on elastic?
[19:26:16] <bd808>	 dcausse: apparently. And it looks like our systemd script for it doesn't start it back up
[19:26:23] <bd808>	 I thought we had fixed that
[19:27:06] <bd808>	 !log Restarted dead logstash process on logstash1001. Looks to have stopped itself due to the the Elasticsearch OOM earlier
[19:27:11] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[19:31:33] <dcausse>	 ps -edf | grep java
[19:32:04] <icinga-wm>	 PROBLEM - Unmerged changes on repository mediawiki_config on mira is CRITICAL: There is one unmerged change in mediawiki_config (dir /srv/mediawiki-staging/, ref HEAD..readonly/master).
[19:32:17] <grrrit-wm>	 (03CR) 10Tim Landscheidt: "My most common use case is to look up the shell user name for a given wiki user name ("Tim Landscheidt"/"Tim_Landscheidt" => "scfc"). I c" [puppet] - 10https://gerrit.wikimedia.org/r/295198 (https://phabricator.wikimedia.org/T122595) (owner: 10Muehlenhoff)
[19:33:06] <moritzm>	 logstash is in status "active, running"?
[19:33:23] <moritzm>	 it takes a bit to complete it's startuo, though
[19:34:28] <bd808>	 moritzm: when the icinga alert went off it was "Active: inactive (dead) since Tue 2016-06-21 19:17:30 UTC; 6min ago"
[19:34:45] <bd808>	 so I did service stop && service start on it
[19:35:20] <moritzm>	 ok, so apparentty it tried to restart itself, but that failed with an I/O exception
[19:35:26] <grrrit-wm>	 (03PS1) 10BBlack: stream.wm.o: drop all DNS TTLs to 5m [dns] - 10https://gerrit.wikimedia.org/r/295384 (https://phabricator.wikimedia.org/T134871) 
[19:35:28] <grrrit-wm>	 (03PS1) 10BBlack: stream.wm.o: move to cache_misc in DNS [dns] - 10https://gerrit.wikimedia.org/r/295385 (https://phabricator.wikimedia.org/T134871) 
[19:35:57] <moritzm>	 at least it logs multiple org.apache.http.impl.execchain.RetryExec execute log lines in journalctl
[19:36:36] <moritzm>	 so it seems systemd correctly tried to respawn, but that failed
[19:36:41] <grrrit-wm>	 (03CR) 10BBlack: [C: 032] stream.wm.o: drop all DNS TTLs to 5m [dns] - 10https://gerrit.wikimedia.org/r/295384 (https://phabricator.wikimedia.org/T134871) (owner: 10BBlack)
[19:39:19] <moritzm>	 I'd say let's make a task of it, it does sound like a bug
[19:40:05] <bd808>	 I'll write one up
[19:47:43] <icinga-wm>	 PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0]
[19:48:30] <grrrit-wm>	 (03CR) 10BBlack: "Just being pedantic, but something seems off with the %diff calculations. How can a file's size be reduced by more than 100%?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295374 (owner: 10Ori.livneh)
[19:52:06] <bd808>	 moritzm: filed as T138345
[19:52:07] <stashbot>	 T138345: Systemd unit did not restart logstash process that died for Elasticsearch connection failures - https://phabricator.wikimedia.org/T138345
[19:57:04] <grrrit-wm>	 (03CR) 10Platonides: "That's a very good point, BBlack." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295374 (owner: 10Ori.livneh)
[19:57:20] * Platonides is pedantic, too
[20:00:20] <grrrit-wm>	 (03CR) 10Ori.livneh: "I made two mistakes: I calculated percent difference instead of percent change, and I expressed percent difference as a negative number, w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295374 (owner: 10Ori.livneh)
[20:01:11] <ori>	 just don't tell anyone
[20:06:03] <icinga-wm>	 PROBLEM - puppet last run on elastic2001 is CRITICAL: CRITICAL: puppet fail
[20:12:28] <grrrit-wm>	 (03PS2) 10Ori.livneh: Optimize mobile static images [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295374 
[20:13:19] <ori>	 Platonides: better?
[20:13:52] <Platonides>	 heh
[20:14:25] <Platonides>	 what was percent difference?
[20:17:49] <ori>	 percent difference is a way of expressing the difference between two values when there is no direction of change (before/after), just two values that mean the same thing
[20:17:58] <ori>	 it's 100 * |a - b| / ((a + b) * 2)
[20:18:36] <ori>	 it's not useful in this case
[20:18:55] <Platonides>	 not at all
[20:19:11] <Platonides>	 (and still, the numbers don't match :P)
[20:19:13] <ori>	 nope, whoever thought so is a careless idiot
[20:19:32] * Platonides gives ori a ^
[20:22:07] <ori>	 / 2, not * 2
[20:22:53] <ori>	 to sum: i used the wrong metric, presented it in the wrong way, and then defined it incorrectly
[20:22:55] <Platonides>	 ah
[20:22:57] <Platonides>	 lol
[20:23:11] <Platonides>	 sorry ori, you can't be perfect everyday ;)
[20:23:29] <ori>	 only 104% of the time
[20:23:44] <ori>	 thanks for pointing it out :) /me lunches
[20:24:07] <Platonides>	 it was bblack who spotted it
[20:24:13] <Platonides>	 I then got intrigued about it
[20:26:07] <ori>	 https://vimeo.com/4435893
[20:30:54] <icinga-wm>	 PROBLEM - puppet last run on wtp2015 is CRITICAL: CRITICAL: puppet fail
[20:32:03] <icinga-wm>	 RECOVERY - puppet last run on elastic2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures
[20:56:52] <icinga-wm>	 RECOVERY - puppet last run on wtp2015 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures
[21:08:07] <icinga-wm>	 PROBLEM - CirrusSearch codfw 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0]
[21:16:50] <wikibugs>	 06Operations, 06Release-Engineering-Team, 07Developer-notice, 05Gitblit-Deprecate, and 2 others: Redirect Gitblit urls (git.wikimedia.org) -> Diffusion urls (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2397757 (10greg) email sent. The countdown begins :)
[21:18:25] <wikibugs>	 07Blocked-on-Operations, 10Continuous-Integration-Infrastructure, 10Packaging, 05Gerrit-Migration, and 2 others: Package xhpast (libphutil) - https://phabricator.wikimedia.org/T137770#2397758 (10mmodell) 05Open>03Resolved
[21:25:52] <wikibugs>	 06Operations, 06Release-Engineering-Team, 07Developer-notice, 05Gitblit-Deprecate, and 2 others: Redirect Gitblit urls (git.wikimedia.org) -> Diffusion urls (phabricator.wikimedia.org/diffusion) - https://phabricator.wikimedia.org/T137224#2397765 (10Paladox) @greg thanks :)
[21:31:38] <wikibugs>	 07Blocked-on-Operations, 05Continuous-Integration-Scaling, 13Patch-For-Review, 07WorkType-NewFunctionality: Attempt to provide a Trusty image for Nodepool - https://phabricator.wikimedia.org/T133203#2397768 (10greg)
[21:35:30] <wikibugs>	 06Operations, 10Continuous-Integration-Infrastructure, 10Nodepool, 10Phabricator, and 3 others: Make sure phab can talk to gearman and nodepool instances can talk to phabricator - https://phabricator.wikimedia.org/T131375#2397825 (10greg)
[21:37:02] <icinga-wm>	 RECOVERY - CirrusSearch codfw 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0]
[21:52:18] <grrrit-wm>	 (03PS1) 10Smalyshev: Prepare scap3 deployment for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/295437 (https://phabricator.wikimedia.org/T129144) 
[21:53:35] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Prepare scap3 deployment for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/295437 (https://phabricator.wikimedia.org/T129144) (owner: 10Smalyshev)
[21:56:00] <grrrit-wm>	 (03PS3) 10Ori.livneh: Optimize mobile static images [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295374 
[21:56:17] <grrrit-wm>	 (03CR) 10Ori.livneh: [C: 032] Optimize mobile static images [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295374 (owner: 10Ori.livneh)
[21:56:53] <grrrit-wm>	 (03Merged) 10jenkins-bot: Optimize mobile static images [mediawiki-config] - 10https://gerrit.wikimedia.org/r/295374 (owner: 10Ori.livneh)
[21:58:48] <wikibugs>	 07Blocked-on-Operations, 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: adywiki and jamwiki are missing the associated *_p databases with appropriate views - https://phabricator.wikimedia.org/T135029#2397893 (10Gehel) Summary of a discussion with @ori:  The maintain-replicas script creates a new sch...
[22:00:08] <grrrit-wm>	 (03PS2) 10Smalyshev: Prepare scap3 deployment for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/295437 (https://phabricator.wikimedia.org/T129144) 
[22:00:55] <wikibugs>	 06Operations: setup syslog server in codfw - https://phabricator.wikimedia.org/T138073#2397902 (10RobH)
[22:00:57] <wikibugs>	 06Operations, 10ops-codfw, 10hardware-requests: procure syslog hardware in codfw - https://phabricator.wikimedia.org/T138075#2397898 (10RobH) 05Open>03stalled I'm stalling this task for #procurement T138353.  I'll gather pricing info on that task, and present the various options for review.
[22:01:18] <grrrit-wm>	 (03CR) 10jenkins-bot: [V: 04-1] Prepare scap3 deployment for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/295437 (https://phabricator.wikimedia.org/T129144) (owner: 10Smalyshev)
[22:06:56] <grrrit-wm>	 (03PS3) 10Smalyshev: Prepare scap3 deployment for WDQS [puppet] - 10https://gerrit.wikimedia.org/r/295437 (https://phabricator.wikimedia.org/T129144) 
[22:08:31] <logmsgbot>	 !log ori@tin Synchronized static/images/mobile: I8f09e825: Optimize mobile static images (duration: 00m 34s)
[22:08:34] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:09:22] <icinga-wm>	 RECOVERY - Unmerged changes on repository mediawiki_config on mira is OK: No changes to merge.
[22:11:23] <wikibugs>	 06Operations, 06Discovery, 10Elasticsearch, 03Discovery-Search-Sprint, 13Patch-For-Review: Install and configure new elasticsearch servers in eqiad - https://phabricator.wikimedia.org/T138329#2397914 (10EBernhardson)
[22:14:15] <wikibugs>	 06Operations, 06Discovery, 10Elasticsearch, 10Wikimedia-Logstash, 03Discovery-Search-Sprint: Logstash elasticsearch mapping does not allow err.code to be a string - https://phabricator.wikimedia.org/T137400#2397921 (10EBernhardson)
[22:14:16] <grrrit-wm>	 (03CR) 10BBlack: [C: 031] tlsproxy: enable client/server TFO support in the kernel [puppet] - 10https://gerrit.wikimedia.org/r/295331 (https://phabricator.wikimedia.org/T108827) (owner: 10Ema)
[22:19:44] <bd808>	 !log Backfilled missing 2016-06-20 data to https://tools.wmflabs.org/sal/production?d=2016-06-20
[22:19:48] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[22:23:02] <icinga-wm>	 PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 648 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5149571 keys - replication_delay is 648
[22:34:02] <icinga-wm>	 RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5147613 keys - replication_delay is 0
[22:37:56] <wikibugs>	 07Blocked-on-Operations, 06Operations, 10DBA, 06Labs, 10Labs-Infrastructure: adywiki and jamwiki are missing the associated *_p databases with appropriate views - https://phabricator.wikimedia.org/T135029#2397950 (10Krenair) I agree, although it needs to be a separate ticket and I don't think we can just...
[23:00:04] <jouncebot>	 RoanKattouw, ostriches, Krenair, MaxSem, and Dereckson: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20160621T2300).
[23:00:52] <Dereckson>	 Hello.
[23:01:03] <Dereckson>	 There is nothing to SWAT this night.
[23:01:20] <Dereckson>	 evening
[23:01:23] <yurik>	 Dereckson, i would like to swat my own services
[23:01:26] <grrrit-wm>	 (03CR) 10Thcipriani: "Looks good, also need to remove wdqs/wdqs from hieradata/common/role/deployment.yaml and that should be it!" [puppet] - 10https://gerrit.wikimedia.org/r/295437 (https://phabricator.wikimedia.org/T129144) (owner: 10Smalyshev)
[23:01:28] <yurik>	 i had some trouble today
[23:01:55] <yurik>	 building services, because it turns out that the wonderful russian firewall is blocking the AWS !
[23:02:05] <yurik>	 and i couldn't build the depl packages :(
[23:02:12] <Dereckson>	 Annoying.
[23:02:14] <Krenair>	 You can probably request VPN access
[23:02:15] * yurik is not frustrated...
[23:02:22] <yurik>	 5+ hours wasted
[23:02:36] <yurik>	 Krenair, i finally did
[23:02:48] <yurik>	 its figuring out that i am being blocked that took some time!
[23:03:07] <yurik>	 because a minor script deep inside the build system by a 3rd party was failing :(
[23:03:21] <yurik>	 and it was falling back onto the local build, which was also now working
[23:03:35] <yurik>	 bleh, anyway, if nooone is deploying, i wll depl kartotherian & tilerator
[23:03:46] <yurik>	 unless there are some objections
[23:05:51] <tgr>	 !log deleted localuser rows for Mahir256@orwikisource and A879071@enwiki for T119736
[23:05:52] <stashbot>	 T119736: Could not find local user data for {Username}@{wiki} - https://phabricator.wikimedia.org/T119736
[23:05:55] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:06:11] <grrrit-wm>	 (03PS1) 10EBernhardson: Duplicate logstash output to alternate elasticsearch cluster [puppet] - 10https://gerrit.wikimedia.org/r/295442 
[23:06:21] <Krenair>	 Well I'm certainly not objecting. I'm not really familiar with those services anyway although I think the second one is maps-related
[23:07:36] <yurik>	 Krenair, they both are ;)
[23:07:52] * yurik loves when users are being deleted from db by hand
[23:08:21] <yurik>	 especially because the wonderful sql's DELETE Blah means delete everything
[23:08:42] <Krenair>	 yeah, well... I'm sure the data makes more sense after it's done than before
[23:09:20] <Krenair>	 and yeah, those DELETE .. WHERE clauses are not something you'd want to screw up on production master DB servers, that's for sure :)
[23:11:00] <Dereckson>	 Hi tgr. That remembers me I've a funny other issue: a sessionfailure message when I tried to mark patrolled an hidden Flow diff: ?title=Topic:...&action=markpatrolled&rcid=... Do you think that's for AuthManager or only for Flow?
[23:12:31] <grrrit-wm>	 (03CR) 10BryanDavis: [C: 031] "LGTM. Should test via cherry-pick on deployment-puppetmaster" [puppet] - 10https://gerrit.wikimedia.org/r/295442 (owner: 10EBernhardson)
[23:13:43] * yurik loves the new scap3!
[23:13:53] <grrrit-wm>	 (03PS1) 10BBlack: cache_upload: experiment with higher fe hfp cutoff [puppet] - 10https://gerrit.wikimedia.org/r/295443 
[23:14:15] <yurik>	 !log updated/restarted kartotherian & tilerator - https://gerrit.wikimedia.org/r/#/c/295440/ https://gerrit.wikimedia.org/r/#/c/295441/
[23:14:18] <morebots>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master
[23:14:39] <grrrit-wm>	 (03CR) 10BBlack: [C: 032 V: 032] cache_upload: experiment with higher fe hfp cutoff [puppet] - 10https://gerrit.wikimedia.org/r/295443 (owner: 10BBlack)
[23:28:00] <grrrit-wm>	 (03PS2) 10EBernhardson: Duplicate logstash output to alternate elasticsearch cluster [puppet] - 10https://gerrit.wikimedia.org/r/295442 
[23:33:16] <wikibugs>	 06Operations, 06Community-Liaisons, 10Wikimedia-Mailing-lists: mailman maint window 2016-06-21 16:00 - 18:00 UTC - https://phabricator.wikimedia.org/T138228#2398047 (10RobH) I neglected to update this task yesterday.  So the maint window is delayed until AFTER wikimania.  There is an ongoing discussion with...
[23:35:40] <grrrit-wm>	 (03PS3) 10EBernhardson: Duplicate logstash output to alternate elasticsearch cluster [puppet] - 10https://gerrit.wikimedia.org/r/295442