[00:03:00] Krenair, works [00:03:03] thanks you [00:04:59] (03PS1) 10Dzahn: mailman: redirects for search lists -> discovery [puppet] - 10https://gerrit.wikimedia.org/r/238650 (https://phabricator.wikimedia.org/T110256) [00:07:42] oh it finished [00:08:34] (03PS1) 10Dzahn: mailman: exim alias for discovery list renames [puppet] - 10https://gerrit.wikimedia.org/r/238652 [00:08:57] !log krenair@tin Synchronized php-1.26wmf23/extensions/VisualEditor/modules/ve-mw/ui/styles/dialogs: https://gerrit.wikimedia.org/r/#/c/238646/ (duration: 00m 12s) [00:09:02] James_F, ^ [00:09:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:09:07] Krenair: Thanks! [00:09:23] Krenair: And confirmed. [00:09:24] twentyafterfour, ^ [00:10:14] Krenair: so I can go now? :P [00:10:27] I think twentyafterfour was there first? [00:10:34] oops [00:10:38] legoktm: go ahead [00:10:39] * legoktm gets in line [00:10:48] you sure? [00:10:52] sure [00:11:03] thanks :) [00:11:15] legoktm: meanwhile, would love a review of https://gerrit.wikimedia.org/r/#/c/238647/ [00:11:19] :) [00:11:27] Oh, I'll wait until after :) [00:11:39] !log reinstalling lvs400[12] to jessie (traffic on 400[34], already jessie) [00:11:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:14:05] !log legoktm@tin Synchronized php-1.26wmf22/extensions/CentralAuth/includes/: Use set() for tokens with unique keys (duration: 00m 12s) [00:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:15:01] !log legoktm@tin Synchronized php-1.26wmf23/extensions/CentralAuth/includes/: Use set() for tokens with unique keys (duration: 00m 12s) [00:15:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:16:41] twentyafterfour: all done [00:21:52] Krinkle: looking now [00:22:33] 6operations, 7Mail: Remove Alias for sj@wm.o - https://phabricator.wikimedia.org/T108276#1643631 (10Dzahn) Samuel is still subscribed to various mailing lists as sj@wikimedia.org. sj@ is an alias for sklein@. If we just delete that those list admins are going to get errors. sj@wikimedia.org found in: wm... [00:23:57] 6operations, 7Mail: Remove Alias for sj@wm.o - https://phabricator.wikimedia.org/T108276#1643632 (10Dzahn) 5Open>3stalled [00:24:22] legoktm: Aye, looks like that patch may actually also save VE. Dang. I didn't foresee that [00:24:24] it waits for "user" [00:24:27] which my patch fixes [00:24:59] thx [00:26:25] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Mailman Upgrade (Jessie & Mailman 2.x) and migration to a VM - https://phabricator.wikimedia.org/T105756#1643635 (10Dzahn) [00:26:26] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: fermium needs to have exim4-daemon-heavy installed, not -light - https://phabricator.wikimedia.org/T112229#1643633 (10Dzahn) 5Open>3Resolved This was fixed after https://gerrit.wikimedia.org/r/#/c/237658/ [00:27:26] twentyafterfour: I'm hotfixing a regression in wmf22 now. User scripts broke globally as side effect of a config change earlier [00:27:44] sorry for the intrusion. I can back off depending on waht you're doing. [00:28:25] 6operations, 10Wikimedia-Mailing-lists: wikimediabe-l: decide status of list - https://phabricator.wikimedia.org/T110974#1643645 (10Dzahn) what did you want to achieve here, @JohnLewis? i'd say the status is "still active", so nothing to do? [00:28:34] 6operations, 10Wikimedia-Mailing-lists: wikimediabe-l: decide status of list - https://phabricator.wikimedia.org/T110974#1643646 (10Dzahn) p:5Normal>3Low [00:28:49] 6operations, 10Wikimedia-Mailing-lists: wikimediabe-l: decide status of list - https://phabricator.wikimedia.org/T110974#1643649 (10Dzahn) a:3JohnLewis [00:29:19] !log krinkle@tin Synchronized php-1.26wmf22/resources/src/mediawiki/mediawiki.js: hotfix Ia2fcd13f4 (duration: 00m 11s) [00:29:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [00:29:34] 6operations, 10Wikimedia-Mailing-lists, 6Wiktionary: wiktionary-l: assign new moderators - https://phabricator.wikimedia.org/T110969#1643657 (10Dzahn) needs "Thehelpfulone" [00:32:26] 6operations, 7Graphite, 7Monitoring: Restrict edit rights in grafana / enable dashboard deletion - https://phabricator.wikimedia.org/T93710#1643663 (10Dzahn) 5Open>3Resolved a:3Dzahn Claiming it's resolved per "" in the second linked patch. @gwicke ok? [00:33:19] 10Ops-Access-Requests, 6operations, 10Continuous-Integration-Infrastructure, 5Patch-For-Review: Let contint-admins force run puppet with /usr/local/sbin/puppet-run - https://phabricator.wikimedia.org/T110943#1643666 (10Dzahn) @hashar yes, i will do that [00:39:00] 10Ops-Access-Requests, 6operations, 3Discovery-Wikidata-Query-Service-Sprint, 7Icinga, 5Patch-For-Review: Get smalyshev permissions to icinga enough to control monitoring for wdqs_eqiad group - https://phabricator.wikimedia.org/T111243#1643682 (10Dzahn) This needs https://gerrit.wikimedia.org/r/#/c/23506... [00:40:04] 6operations, 7Mail: Remove Alias for sj@wm.o - https://phabricator.wikimedia.org/T108276#1643685 (10Dzahn) @JGulingan is sklein@wikimedia.org existing in Google and an autoresponder? [00:40:32] 6operations, 7Mail: Remove Alias for sj@wm.o - https://phabricator.wikimedia.org/T108276#1643686 (10Dzahn) p:5Triage>3Normal [00:41:08] 6operations, 6Performance-Team: Define SLAs for media - https://phabricator.wikimedia.org/T112692#1643687 (10Dzahn) p:5Triage>3Normal [00:41:28] 6operations: Upgrade phpredis client on zend - https://phabricator.wikimedia.org/T112694#1643689 (10Dzahn) p:5Triage>3Normal [00:41:29] Done [00:42:27] Krinkle: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29#Broken_User_Scripts some users noticed ;) [00:43:26] 6operations, 7Mail: Remove Alias for sj@wm.o - https://phabricator.wikimedia.org/T108276#1643691 (10Dzahn) a:3JGulingan [00:43:39] legoktm: I expect nothing less [00:46:26] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2014_v6, cp4014_v6 [00:48:19] 6operations, 7Mail: Remove Alias for sj@wm.o - https://phabricator.wikimedia.org/T108276#1643698 (10Dzahn) @tbayer as admin of "wmfcc-l", can you decide what to do with Samuel's subscription? Either replace with his private gmail address or remove? I'm not sure how what is appropriate on this list but we are b... [00:49:46] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK [00:50:57] 6operations, 7Mail: Remove Alias for sj@wm.o - https://phabricator.wikimedia.org/T108276#1643703 (10Krenair) >>! In T108276#1643698, @Dzahn wrote: > chaptercommittee-l is a redirect to "AffCom". not sure here either. @Varnent is one of the AffCom list admins [00:58:07] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 57 not-conn: cp2005_v6, cp3039_v6, cp4006_v6 [00:58:43] 6operations, 10Traffic, 5Patch-For-Review: Upgrade codfw,ulsfo,esams LVS to jessie - https://phabricator.wikimedia.org/T96375#1643713 (10BBlack) [00:59:46] 6operations, 10Traffic, 5Patch-For-Review: Upgrade codfw,ulsfo,esams LVS to jessie - https://phabricator.wikimedia.org/T96375#1215460 (10BBlack) ulsfo: tested backups on jessie most of today, and converted primaries this evening (whole site now on jessie LVS) esams: will likely do the same tomorrow [01:01:32] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK [01:03:14] !log krinkle@tin Synchronized php-1.26wmf23/resources/src/mediawiki/mediawiki.js: hotfix Ia2fcd13f4 (duration: 00m 12s) [01:03:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:05:16] 6operations, 7Graphite, 7Monitoring: Restrict edit rights in grafana / enable dashboard deletion - https://phabricator.wikimedia.org/T93710#1643720 (10GWicke) @dzahn, yes indeed. Thanks, @ori! [01:05:50] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 59 connecting: (unnamed) not-conn: cp2014_v6 [01:06:11] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 57 not-conn: cp3033_v6, cp3037_v6, cp4006_v6 [01:07:30] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 60 ESP OK [01:07:51] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 60 ESP OK [01:08:44] (03PS1) 10Tim Landscheidt: Tools: Source python-socketio-client for Trusty from backports [puppet] - 10https://gerrit.wikimedia.org/r/238662 (https://phabricator.wikimedia.org/T91874) [01:13:01] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 57 connecting: (unnamed) not-conn: cp2011_v6, cp3047_v6, cp4015_v6 [01:14:42] Krenair: legoktm Krinkle (and others) I'm going to kill grrrit-wm for a bit [01:14:48] k [01:14:49] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK [01:15:51] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp4014_v6 [01:17:30] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 60 ESP OK [01:19:09] sendmmsg(10, {{{msg_name(0)=NULL, msg_iov(1)=[{"WS\1\0\0\1\0\0\0\0\0\0\3irc\10freenode\3net\0\0\1\0\1", 34}], msg_controllen=0, msg_flags=0}, 34}, {{msg_name(0)=NULL, msg_iov(1)=[{"\336V\1\0\0\1\0\0\0\0\0\0\3irc\10freenode\3net\0\0\34\0\1", 34}], msg_controllen=0, msg_flags=0}, 34}}, 2, MSG_NOSIGNAL) = 2 [01:19:11] it's stuck doing that [01:19:13] over and over again [01:19:31] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2008_v6, cp2017_v6 [01:21:21] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp4013_v6 [01:21:25] interesting [01:21:29] I can't hit the internet from my containers [01:22:30] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp3043_v6 [01:22:50] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 60 ESP OK [01:23:00] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK [01:25:51] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 60 ESP OK [01:28:09] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 56 not-conn: cp2002_v6, cp3034_v6, cp3046_v6, cp4006_v6 [01:30:24] 6operations, 6Labs: Investigate whether to use Debian's jessie-backports - https://phabricator.wikimedia.org/T107507#1643731 (10scfc) (I submitted a change for #Tool-Labs to enable backports for specific packages only at https://gerrit.wikimedia.org/r/#/c/238662/.) [01:31:39] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK [01:32:49] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp4015_v6 [01:34:49] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp3043_v6 [01:36:01] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 60 ESP OK [01:36:29] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 60 ESP OK [01:38:59] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [01:40:40] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10693 bytes in 0.114 second response time [01:44:20] PROBLEM - IPsec on cp1046 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp3016_v6 [01:45:00] (03PS1) 10Tim Landscheidt: Tools: Move configuration from wikitech's Hiera to hieradata/ [puppet] - 10https://gerrit.wikimedia.org/r/238663 [01:47:40] RECOVERY - IPsec on cp1046 is OK: Strongswan OK - 24 ESP OK [01:48:39] PROBLEM - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [01:49:42] grrrit-wm uses sendmmsg()? I'm kinda surprised, I wouldn't expect many high-level abstractions for that in common use yet, it's only been around a few years, and totally doesn't matter unless you're really trying to optimize hard. [01:51:01] in any case, your sendmmsg() data is a DNS request for irc.freenode.net [01:52:37] bblack: yeah, so there's no internet connectivity from inside the docker container to the wide world [01:52:45] bblack: this is also running node 4 which was released a few weeks ago [01:52:50] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 59 connecting: (unnamed) not-conn: cp2020_v6 [01:54:37] bblack: thanks! I didn't realize it was a DNS request... [01:54:40] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 58 connecting: (unnamed) not-conn: cp2014_v6, cp4015_v6 [01:55:11] RECOVERY - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 301 TLS Redirect - 505 bytes in 1.008 second response time [01:56:11] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 60 ESP OK [01:56:40] FYI, I recognized it just because of the data format. when hostnames are encoded in the DNS protocol, each label (the things between the dots) is separately encoded with a length byte in front of it [01:57:09] so \3irc\10freenode\3net, assuming the \numbers are octal, is 3bytes "irc" 8bytes "freenode" 3bytes "net" [01:57:31] nice! Didn't realize that [01:57:46] anything else but dns probably wouldn't look like that, it would have some encoding of a 15-16 byte string with dots in it [01:57:56] I haven't looked at the DNS protocol at all, but I bet you have many many many times :D [01:58:01] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK [01:58:08] too many times! [01:58:25] heh [01:58:26] nice [02:06:21] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp3048_v6, cp4015_v6 [02:08:01] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 60 ESP OK [02:09:12] the last two weeks have been like a dive into the deep end of networking [02:09:17] relatively speaking at least [02:13:25] (03PS1) 10Yuvipanda: k8s: Turn on ip masquerading for kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/238666 [02:13:49] (03PS2) 10Yuvipanda: k8s: Turn on ip masquerading for kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/238666 [02:13:59] (03CR) 10Yuvipanda: [C: 032 V: 032] k8s: Turn on ip masquerading for kube-proxy [puppet] - 10https://gerrit.wikimedia.org/r/238666 (owner: 10Yuvipanda) [02:16:09] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [02:18:11] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 59 connecting: (unnamed) not-conn: cp2008_v6 [02:19:29] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 10693 bytes in 0.103 second response time [02:25:10] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp3047_v6 [02:27:40] PROBLEM - IPsec on cp1060 is CRITICAL: Strongswan CRITICAL - ok: 23 not-conn: cp2015_v6 [02:28:29] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 60 ESP OK [02:29:19] RECOVERY - IPsec on cp1060 is OK: Strongswan OK - 24 ESP OK [02:30:19] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp4005_v6 [02:31:40] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK [02:32:00] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK [02:33:10] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 58 not-conn: cp2017_v6, cp3042_v6 [02:34:37] mehrrit-wm1: hi! [02:34:41] are you from docker [02:34:54] YES YOU ARE [02:34:57] HOW BEATUFIUL [02:35:00] !log l10nupdate@tin Synchronized php-1.26wmf22/cache/l10n: l10nupdate for 1.26wmf22 (duration: 07m 02s) [02:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:38:10] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 60 ESP OK [02:38:49] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf22) at 2015-09-16 02:38:48+00:00 [02:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:40:20] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp3043_v6 [02:42:00] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 60 ESP OK [02:42:15] (03PS2) 10Yuvipanda: Tools: Move configuration from wikitech's Hiera to hieradata/ [puppet] - 10https://gerrit.wikimedia.org/r/238663 (owner: 10Tim Landscheidt) [02:42:21] mmmmmmm [02:44:56] 6operations, 7Mail: Remove Alias for sj@wm.o - https://phabricator.wikimedia.org/T108276#1643774 (10Tbayer) >>! In T108276#1643698, @Dzahn wrote: > @tbayer as admin of "wmfcc-l", can you decide what to do with Samuel's subscription? Either replace with his private gmail address or remove? I'm not sure how what... [02:45:01] PROBLEM - IPsec on cp1071 is CRITICAL: Strongswan CRITICAL - ok: 59 connecting: (unnamed) not-conn: cp3036_v6 [02:46:40] RECOVERY - IPsec on cp1071 is OK: Strongswan OK - 60 ESP OK [02:50:11] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp3042_v6 [02:51:50] RECOVERY - IPsec on cp1073 is OK: Strongswan OK - 60 ESP OK [02:59:52] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 59 not-conn: cp4015_v6 [03:01:39] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 60 ESP OK [03:02:22] RECOVERY - BGP status on cr1-ulsfo is OK: OK: host 198.35.26.192, sessions up: 12, down: 0, shutdown: 0 [03:03:06] hi krrrit-wm [03:04:34] (03CR) 10Yuvipanda: [C: 032] Tools: Move configuration from wikitech's Hiera to hieradata/ [puppet] - 10https://gerrit.wikimedia.org/r/238663 (owner: 10Tim Landscheidt) [03:05:20] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-others/snapshot is not accessible: Permission denied [03:05:34] !log l10nupdate@tin Synchronized php-1.26wmf23/cache/l10n: l10nupdate for 1.26wmf23 (duration: 10m 30s) [03:05:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:06:17] Krenair: legoktm Krinkle and others it's back! and is running in kubernetes now [03:06:19] I'll update the docs [03:06:41] kuberwhat? [03:06:56] Is that on a labs host? [03:07:00] kubernetes.io I presume [03:07:03] Krinkle: kubernetes.io, in toollabs. [03:07:16] it's the alternative to gridengine I've been setting up [03:07:22] I'll send out an announcement today. [03:07:25] needed to test it [03:07:33] I'll update docs now [03:07:48] Krinkle: it's running in a docker container with node 4 now [03:07:50] How many nodes? [03:07:55] 2 nodes atm [03:08:01] I'm adding a couple more now [03:08:10] but access is 'opt-in' - and only lolrrit-wm has access now [03:08:35] no NFS access yet [03:08:41] YuviPanda: node 4? [03:09:10] hmm? [03:09:12] oh [03:09:15] Krinkle: I meant nodejs 4 [03:09:19] 4.0 [03:09:26] oh, lolrrit is a node program? [03:09:33] I didn'tknow that [03:10:25] Krinkle: yes [03:11:02] It uses yaml for config, and you maintain it, so I assume python [03:11:22] I've only ever seen the yaml file :D [03:11:52] YuviPanda, you should tell Krinkle about the time I confused JS for python because I saw some other file and assumed everything else was python. ;) [03:12:08] !log l10nupdate@tin LocalisationUpdate completed (1.26wmf23) at 2015-09-16 03:12:08+00:00 [03:12:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:12:28] Krinkle: yup, subbu wrote a patch to the js file *in python* and then took me pointing it out twice to him :D [03:12:58] Krinkle: it deals with two otherwise blocking things (ssh for gerrit and irc) and so for these async type things I prefer js [03:13:07] I'm also surprised it just worked on nodejs [03:13:49] is it a better story than the OCG outage because someone kept string concatenation with + in PHP when porting the code from javascript? [03:14:13] the OCG one is better purely because it caused actual outage [03:14:30] https://gerrit.wikimedia.org/r/#/c/222615/ [03:14:58] "Javascript is eating the world" [03:15:35] now I've unmounted all NFS from the worker nodes [03:15:36] \o/ [03:15:46] so nodes can have labels, so users can choose between having NFS or not [03:15:53] for their containers [03:20:19] PROBLEM - Unmerged changes on repository puppet on palladium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [03:20:40] PROBLEM - Unmerged changes on repository puppet on strontium is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet). [03:20:44] YuviPanda: ^ [03:21:00] aaaah sorry got distracted by the shinnnny [03:21:27] paravoid: so labs puppetmaster isn't affected by ^ since it just has a cron so I am missing these more often now [03:21:50] RECOVERY - Unmerged changes on repository puppet on palladium is OK: No changes to merge. [03:22:20] RECOVERY - Unmerged changes on repository puppet on strontium is OK: No changes to merge. [03:27:46] !log ori@tin Synchronized php-1.26wmf23/vendor/monolog/monolog/src/Monolog/Logger.php: Iccfda47689: monolog: Dont waste milliseconds counting microseconds ; sync-file php-1.26wmf22/vendor/monolog/monolog/src/Monolog/Logger.php Iccfda47689: monolog: Dont waste milliseconds counting microseconds (duration: 00m 12s) [03:27:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:28:09] stupid quotes [03:28:31] RECOVERY - Disk space on labstore1002 is OK: DISK OK [03:28:39] !log ori@tin Synchronized php-1.26wmf22/vendor/monolog/monolog/src/Monolog/Logger.php: Iccfda47689: monolog: Don't waste milliseconds counting microseconds (duration: 00m 12s) [03:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:47:05] 6operations, 7Mail: Remove Alias for sj@wm.o - https://phabricator.wikimedia.org/T108276#1643793 (10Sj) Hello all, catching up. This thread was being downfiltered in my mail client (speaking of email quirks!) Yea, sj@ has been in use for a while; thanks for tracking down all the crufty places it is used. I... [03:49:59] PROBLEM - puppet last run on mw2131 is CRITICAL: CRITICAL: puppet fail [03:50:30] (03PS2) 10Yuvipanda: toollabs: add python-psycopg2 [puppet] - 10https://gerrit.wikimedia.org/r/238450 (owner: 10Merlijn van Deen) [03:51:22] (03CR) 10Yuvipanda: [C: 032 V: 032] "For andrewbogott" [puppet] - 10https://gerrit.wikimedia.org/r/238450 (owner: 10Merlijn van Deen) [04:03:40] PROBLEM - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-maps/snapshot is not accessible: Permission denied [04:09:21] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=1.90 Read Requests/Sec=885.84 Write Requests/Sec=19.38 KBytes Read/Sec=20987.95 KBytes_Written/Sec=78.31 [04:11:00] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=1.30 Read Requests/Sec=0.00 Write Requests/Sec=0.40 KBytes Read/Sec=0.00 KBytes_Written/Sec=2.01 [04:18:20] RECOVERY - puppet last run on mw2131 is OK: OK: Puppet is currently enabled, last run 5 seconds ago with 0 failures [05:08:19] (03PS1) 10Alex Monk: Move *.labsdb aliases into DNS [puppet] - 10https://gerrit.wikimedia.org/r/238672 (https://phabricator.wikimedia.org/T63897) [05:53:10] It looks localisation messages are not updated in testwiki, https://test.wikipedia.org/wiki/Special:ContentTranslation?debug=1 [05:53:34] twentyafterfour: did you see anything yesterday? [05:53:37] ^^ [06:11:03] (03CR) 10Giuseppe Lavagetto: Add instrumentation (034 comments) [debs/pybal] - 10https://gerrit.wikimedia.org/r/238152 (https://phabricator.wikimedia.org/T102394) (owner: 10Giuseppe Lavagetto) [06:15:51] Ok. messages are Okay now. [06:30:20] PROBLEM - puppet last run on mc2015 is CRITICAL: CRITICAL: puppet fail [06:30:56] (03CR) 10EBernhardson: [C: 031] TTMServer: enable wikimedia extra plugin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238446 (owner: 10DCausse) [06:31:20] PROBLEM - puppet last run on cp2013 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:20] PROBLEM - puppet last run on labnet1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:30] PROBLEM - puppet last run on mw2081 is CRITICAL: CRITICAL: Puppet has 2 failures [06:31:58] !log l10nupdate@tin ResourceLoader cache refresh completed at Wed Sep 16 06:31:58 UTC 2015 (duration 31m 57s) [06:32:00] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [06:32:41] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:50] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:51] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:30] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:50] PROBLEM - puppet last run on mw2120 is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:07] (03CR) 10EBernhardson: "wrong ticket linked above, should have been T112598" [puppet] - 10https://gerrit.wikimedia.org/r/224651 (owner: 10Manybubbles) [06:56:19] RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [06:56:20] RECOVERY - puppet last run on cp2013 is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures [06:56:29] RECOVERY - puppet last run on mw2081 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [06:56:50] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [06:56:51] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:00] RECOVERY - puppet last run on mc2015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:40] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:50] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [06:57:50] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:58:50] RECOVERY - puppet last run on mw2120 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [07:13:29] jynus: morning. [07:13:42] (03PS3) 10Muehlenhoff: Exclude DNS requests from connection tracking [puppet] - 10https://gerrit.wikimedia.org/r/238447 (https://phabricator.wikimedia.org/T104968) [07:14:02] morning, kart_ [07:14:13] jynus: I have question about truncating table. We have filled data in wikishared/cx_suggestions [07:14:34] and now want to truncate it and refill new data. [07:14:38] filled up? [07:14:56] or just "you not need it any more"? [07:15:05] *do [07:15:15] not need. new data need to be added. [07:15:24] so you want to truncate it? [07:15:29] yes. [07:15:37] What's the best way to do it? [07:15:55] how many rows? [07:16:11] and how busy is it? [07:16:16] (I can check) [07:16:24] I am just following my train thought [07:17:05] if rows are managable and selects are only far and between, TRUNCATE TABLE will be fast and secure [07:17:41] if many rows or table is busy (SELECT all the time), incremental DELETEs would be the way to go [07:17:54] [count(*)] => 6409 [07:18:11] right now table is used in testwiki, so not busy. [07:18:23] select count(*) from cx_suggestions; [07:18:27] ie above. [07:19:17] going for that, truncate on the master works, should take less than 1 second, no lag [07:19:37] jynus: how to make sure I'm doing operation on master? [07:19:41] do you want me to do it? [07:19:47] if yes, send me a ticket [07:20:05] (you know, to verify identity and so before deleting data :-)) [07:20:12] jynus: that would be nice (also document it :)) [07:20:17] Agree. [07:20:19] exactly [07:20:49] is there a chance that old data would be useful? [07:20:56] I like to perform a backup every time destructive operations run [07:21:32] jynus: not useful. [07:21:51] I mean data isn't useful here. [07:22:57] :-) [07:24:39] https://phabricator.wikimedia.org/T112732 [07:24:43] jynus: ^ [07:24:57] Add relavant tags/project. [07:25:34] (03CR) 10Faidon Liambotis: [C: 04-1] "I'm still confused :)" [puppet] - 10https://gerrit.wikimedia.org/r/238447 (https://phabricator.wikimedia.org/T104968) (owner: 10Muehlenhoff) [07:26:09] is this something that may be done more in the future? [07:26:52] 6operations, 10MediaWiki-extensions-GWToolset, 6Multimedia, 7Performance: Undertake a mass upload of 14 million files (1.5 TB) to Commons - https://phabricator.wikimedia.org/T88758#1644053 (10fgiunchedi) >>! In T88758#1641827, @Reedy wrote: >>>! In T88758#1613381, @fgiunchedi wrote: >> afaik this isn't blo... [07:28:53] jynus: yes (in rare cases though) [07:29:42] if that is the case, a mwscript probably would be the best way, that checks what I told you before (size and traffic) [07:31:13] but to be fair, to run it automatically [07:31:30] an incremental delete will always be safer [07:31:48] no metadata locking, slower garbage collection, etc. [07:32:06] for that, there is a batchupdate script or similar [07:32:49] (03CR) 10Muehlenhoff: "Sure, addressing the root cause is even better, I'll gather some data to create a ticket for that." [puppet] - 10https://gerrit.wikimedia.org/r/238447 (https://phabricator.wikimedia.org/T104968) (owner: 10Muehlenhoff) [07:34:54] (03PS1) 10Muehlenhoff: Enable an initial image scaler with base::firewall in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/238682 [07:35:46] jynus: enlighten me for batchupdate. ie documented somewhere? :) [07:36:03] should be :-) [07:36:05] let me check [07:36:25] https://phabricator.wikimedia.org/diffusion/MW/browse/master/maintenance/runBatchedQuery.php [07:37:00] s/This is used on large wikis/this is used everywhere at the WMF/ [07:37:28] I will have to check if it works on the x1 cluster, though [07:38:09] I will put all suggestions on the ticket [07:39:49] Thanks! [07:43:04] (03CR) 10Zfilipin: "I am fine with either checking the submodules or not." [puppet] - 10https://gerrit.wikimedia.org/r/238471 (https://phabricator.wikimedia.org/T102020) (owner: 10JanZerebecki) [07:44:05] (03CR) 10Zfilipin: "Hashar, did you vote verified +1 by mistake, instead of code-review +1?" [puppet] - 10https://gerrit.wikimedia.org/r/238491 (owner: 10Hashar) [07:44:51] kart_, to avoid misscomunications: https://phabricator.wikimedia.org/T112732#1644086 [07:45:13] jynus: looks good to me. [07:45:27] on the ticket, please :-) [07:45:38] that way if it fails, I can blame you on writing :-P [07:45:57] done. [07:45:59] :) [07:46:19] sorry to be so pedantic [07:46:45] but I had issues with doing before or after it was required [07:47:38] also, unlike code deployment, there is not defined procedure for db migrations [07:47:50] jynus: Agree. [07:48:06] and that, I want to change it [07:48:18] jynus: for example, touching other cx_ tables is dangerous :) [07:48:23] kart_: what did you do to fix localisation? [07:48:45] twentyafterfour: I don't know. It fixd itself. [07:48:56] maybe I helped there? [07:49:05] twentyafterfour: may be l10update ran correctly. [07:49:28] wasn't that the tables that were still in latin 1? Or maybe I am confusing them? [07:49:36] hmm ok ... it should have had localization as part of the initial scap ... [07:49:55] oh, then ignore my comment [07:50:00] jynus: tables are binary :) [07:50:08] not all [07:50:25] I discover some that were incorrectly unconfigured (but not yours) [07:50:30] jynus: Another question: how frquent our DB backup for wikishared? [07:50:50] (03CR) 10Zfilipin: WIP Move Ruby related packages to a separate file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/237876 (https://phabricator.wikimedia.org/T110865) (owner: 10Zfilipin) [07:51:08] kart_, well, in general backups run every week [07:51:23] plus binary logs allow point in time recovery [07:51:43] (03PS2) 10Zfilipin: WIP Move Ruby related packages to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/237876 (https://phabricator.wikimedia.org/T110865) [07:51:47] good to know! Thanks. [07:52:18] but I want to do a full audit of backups, last time I found serious problems [07:52:25] (that are already fixed) [07:52:41] but want to do a production-wide evaluation [07:54:41] in general, recovering data is not an issue, problem is time to recovery [07:55:03] (03PS3) 10Zfilipin: WIP Move Ruby related packages to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/237876 (https://phabricator.wikimedia.org/T110865) [07:56:33] kart_, https://phabricator.wikimedia.org/T112732#1644111 [07:56:41] !log depooled mw1153 (videoscaler) to enable ferm [07:56:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:57:03] !log depooled mw1153 (it's an image scaler, of course) to enable ferm [07:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:57:15] !log truncated some tables from ContentTranslation extension on x1 [07:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [07:58:51] (03CR) 10Faidon Liambotis: [C: 04-1] "Thanks for working on this. Responding to your points:" [puppet] - 10https://gerrit.wikimedia.org/r/237871 (owner: 10Tim Landscheidt) [08:00:00] (03CR) 10Zfilipin: "https://gerrit.wikimedia.org/r/#/c/220308/ is merged into master, I have reabased this patch, everything looks fine." [puppet] - 10https://gerrit.wikimedia.org/r/237876 (https://phabricator.wikimedia.org/T110865) (owner: 10Zfilipin) [08:00:25] Hi jynus ! do you have time for 1 more quick PM? :) [08:00:54] sure, have to pay some debts to the current clinit duty, too :-) [08:02:06] jynus: thanks. [08:06:45] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable an initial image scaler with base::firewall in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/238682 (owner: 10Muehlenhoff) [08:12:29] !log repooled mw1153 with ferm enabled [08:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [08:13:54] (03CR) 10Hashar: [C: 04-1] WIP Move Ruby related packages to a separate file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/237876 (https://phabricator.wikimedia.org/T110865) (owner: 10Zfilipin) [08:17:32] (03PS2) 10Alexandros Kosiaris: Ensure correct order in postgresql::user [puppet] - 10https://gerrit.wikimedia.org/r/237565 (https://phabricator.wikimedia.org/T112228) (owner: 10Gergő Tisza) [08:17:39] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] Ensure correct order in postgresql::user [puppet] - 10https://gerrit.wikimedia.org/r/237565 (https://phabricator.wikimedia.org/T112228) (owner: 10Gergő Tisza) [08:21:05] 6operations, 3Discovery-Maps-Sprint: Kartotherian git deploy service restart failed with perm error - https://phabricator.wikimedia.org/T112707#1644155 (10akosiaris) a:5akosiaris>3ArielGlenn This seems to be the same bug as T102039 as noted on https://phabricator.wikimedia.org/T102039#1389253. Assigning t... [08:21:13] 6operations, 3Discovery-Maps-Sprint: Kartotherian git deploy service restart failed with perm error - https://phabricator.wikimedia.org/T112707#1644160 (10akosiaris) p:5Triage>3Normal [08:21:30] apergos: new customer for git deploy failures https://phabricator.wikimedia.org/T112707 [08:22:06] thanks [08:36:16] PROBLEM - Restbase endpoints health on xenon is CRITICAL: /page/data-parsoid/{title} is CRITICAL: Test Get data-parsoid by title returned the unexpected status 500 (expecting: 200) [08:37:57] RECOVERY - Restbase endpoints health on xenon is OK: All endpoints are healthy [08:43:36] (03PS1) 10Aude: Enable arbitrary access on enwiki and s2wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238685 (https://phabricator.wikimedia.org/T100788) [08:49:15] (03CR) 10Aude: [C: 04-1] "not to merge until ~10:00 UTC" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238685 (https://phabricator.wikimedia.org/T100788) (owner: 10Aude) [08:57:09] 7Puppet, 6operations, 5Patch-For-Review: Need to run postgresql::user twice to set the password - https://phabricator.wikimedia.org/T112228#1644213 (10Tgr) 5Open>3Resolved [08:57:10] 7Blocked-on-Operations, 7Puppet, 6Reading-Infrastructure-Team, 10Sentry, and 2 others: Create basic puppet role for Sentry - https://phabricator.wikimedia.org/T84956#1644214 (10Tgr) [09:08:35] 6operations, 10ops-codfw: setup/install/deploy new HP restbase servers for codfw - https://phabricator.wikimedia.org/T112683#1644234 (10fgiunchedi) @papaul when configuring management access in bios please also make sure that "boot mode" is set to legacy bios, we've mistakenly ordered those with uefi by defaul... [09:23:24] anyone else encountered this https://phabricator.wikimedia.org/T112738 ? [09:25:25] Sp7|Away, there may be an issue with recentchages on mediawiki [09:27:56] jynus: seems so, for other wikis its working fine [09:28:15] I am seeing some errors, but cannot reproduce yet [09:28:20] just curious whether its happening to me only or everyone [09:28:26] ok [09:28:46] 6operations, 7JavaScript: Instability on fr.wikiversity project - https://phabricator.wikimedia.org/T112069#1644284 (10Lionel_Scheepmans) [09:29:28] problem with mediawiki is that it has so little traffic compared to, let's say, enwiki, that it is easy to miss [09:29:41] thank you for the ping [09:29:51] will look at it now [09:30:55] Ouch [09:31:10] Hmm WFM though [09:31:19] Looking at exception.log on fluorine [09:31:54] I am seein Call to a member function getTitle() on a non-object (NULL) [09:33:02] 2015-09-16 09:06:49 mw1077 mediawikiwiki exception ERROR: [4cfbdc54] /wiki/Special:RecentChanges BadMethodCallException from line 496 of /srv/mediawiki/php-1.26wmf23/includes/changes/EnhancedChangesList.php: Call to a member function getTitle() on a non-object (NULL) {"exception_id":"4cfbdc54"} [09:33:03] [Exception BadMethodCallException] (/srv/mediawiki/php-1.26wmf23/includes/changes/EnhancedChangesList.php:496) Call to a member function getTitle() on a non-object (NULL) [09:33:26] yes [09:33:57] can you follow up/escalate. My connection is so bad that I need to move [09:34:01] Yeah will do [09:34:20] will be back in a moment [09:39:26] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1644302 (10fgiunchedi) thanks @gwicke, no problem! I noticed `restbase-test2002` and `restbase-2003` are experiencing much higher gc times than `restbase-... [09:40:55] Oh great, it's a Flow RC row :S [09:40:58] (diff | hist) . . N Mediawiki sends wrong password on Project:Support desk; 07:59 . . (+676)‎ . . Vivideo (talk | contribs) [09:41:06] is the one that appears to make it barf [09:41:55] RoanKattouw: a deleted flow topic? [09:42:01] e.g. no title? [09:44:12] It's not deleted: https://www.mediawiki.org/wiki/Topic:Sp0mh9t59dye1tl7 [09:44:20] hmm [09:44:21] It only breaks in enhanced RC [09:44:34] enhanced always gets forgotten :( [09:47:06] Nothing appears to have changed in RC-related code in Flow, checking MW core now [09:47:36] Hmm, but, WTF: (diff | hist) . . N Fatal exception of type MWException on Project:Support desk; 08:27 . . (+530)‎ . . Dier2014 (talk | contribs) [09:47:43] That one doesn't make it fail [09:47:57] Maybe it's only when it's the second entry for the same page? [09:49:09] I think it might be https://gerrit.wikimedia.org/r/#/c/223557/ [09:53:16] PROBLEM - Restbase endpoints health on xenon is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:53:45] PROBLEM - Restbase endpoints health on cerium is CRITICAL: /page/html/{title} is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /page/data-parsoid/{title} is CRITICAL: Test Get data-parsoid by title returned the unexpected status 500 (expecting: 200): /page/title/{title} is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /page/revisio [09:53:56] that's me ^ [09:54:16] PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: /page/html/{title} is CRITICAL: Test Get html by title from storage returned the unexpected status 500 (expecting: 200): /page/data-parsoid/{title} is CRITICAL: Test Get data-parsoid by title returned the unexpected status 500 (expecting: 200): /page/title/{title} is CRITICAL: Test Get rev by title from storage returned the unexpected status 500 (expecting: 200): /page/r [09:56:37] RECOVERY - Restbase endpoints health on xenon is OK: All endpoints are healthy [09:57:15] RECOVERY - Restbase endpoints health on cerium is OK: All endpoints are healthy [09:57:41] Sp7|Away: Thanks for the report, I found the change that broke it and I'm reverting it. Should be fixed shortly [09:57:46] RECOVERY - Restbase endpoints health on praseodymium is OK: All endpoints are healthy [09:57:57] thanks RoanKattouw [10:00:04] aude: Respected human, time to deploy Wikidata (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150916T1000). Please do the needful. [10:00:46] \o/ [10:01:27] aude: I need to deploy the fix for that RC bug, but I probably need to wait for Jenkins longer than you do [10:01:44] So go ahead, and I'll go after you [10:05:22] (03CR) 10Aude: [C: 032] Enable arbitrary access on enwiki and s2wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238685 (https://phabricator.wikimedia.org/T100788) (owner: 10Aude) [10:05:27] ah [10:05:28] ok [10:05:28] (03Merged) 10jenkins-bot: Enable arbitrary access on enwiki and s2wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238685 (https://phabricator.wikimedia.org/T100788) (owner: 10Aude) [10:05:34] this is quick [10:08:14] Yeah mw-config has its own Jenkins queue and it runs really fast [10:08:59] !log aude@tin Synchronized arbitraryaccess.dblist: (no message) (duration: 00m 11s) [10:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:09:13] * aude checks [10:09:14] aude: You just unlocked the kittens :D [10:10:11] https://sv.wikipedia.org/wiki/Anv%C3%A4ndare:Aude [10:10:20] * aude looks for a place for kittens on my enwiki user page [10:11:05] :D [10:12:23] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1644354 (10fgiunchedi) enabled hyperthreading on 2002 and rebooted, will watch for gc times if that makes a difference another side effect of having 2001... [10:12:46] https://en.wikipedia.org/wiki/User:Aude [10:13:07] the css and layout isn't perfect but whatever... [10:13:08] :) [10:13:25] RoanKattouw: done [10:13:48] Cool [10:13:53] I'm still waiting for Jenkins [10:13:55] (round 2) [10:14:18] k [10:24:35] !log catrope@tin Synchronized php-1.26wmf23/includes/changes/EnhancedChangesList.php: T112738 (duration: 00m 12s) [10:24:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:24:46] (03PS1) 10Muehlenhoff: Configure nf_conntrack hash table size and install conntrack check via NRPE [puppet] - 10https://gerrit.wikimedia.org/r/238712 [10:25:06] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1644499 (10fgiunchedi) >>! In T108613#1644354, @fgiunchedi wrote: > enabled hyperthreading on 2002 and rebooted, will watch for gc times if that makes a di... [10:27:03] Sp7|Away: Fix is deployed now, working for me [10:27:32] RoanKattouw: yeah its works for me as well, thanks for the quick response :) [10:38:14] 6operations, 10RESTBase, 10RESTBase-Cassandra: cassandra client authentication - https://phabricator.wikimedia.org/T112742#1644518 (10fgiunchedi) 3NEW a:3fgiunchedi [10:44:14] (03PS2) 10Giuseppe Lavagetto: Add instrumentation [debs/pybal] - 10https://gerrit.wikimedia.org/r/238152 (https://phabricator.wikimedia.org/T102394) [10:44:38] <_joe_> godog: I should've addressed your concerns ^^ [10:44:49] <_joe_> also added tests, etc [10:47:06] !log catrope@tin Started scap: (no message) [10:47:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:47:28] !log catrope@tin scap aborted: (no message) (duration: 00m 21s) [10:47:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:48:39] !log reenabling semisync on db1072 and db1073 [10:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:49:40] (03CR) 10Muehlenhoff: "I'm glad you didn't call it looking-for-freedom.sh :-)" [puppet] - 10https://gerrit.wikimedia.org/r/228137 (owner: 10Dzahn) [10:49:42] (03CR) 10JanZerebecki: [C: 031] Whitelist m.wikidata.org for central auth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238394 (https://phabricator.wikimedia.org/T112087) (owner: 10Bene) [10:49:56] !log catrope@tin Synchronized php-1.26wmf22: (no message) (duration: 02m 12s) [10:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:50:22] 6operations, 10RESTBase, 10RESTBase-Cassandra: cassandra client authentication - https://phabricator.wikimedia.org/T112742#1644547 (10fgiunchedi) [10:50:24] 6operations, 10RESTBase, 10RESTBase-Cassandra: use non-default credentials when authenticating to Cassandra - https://phabricator.wikimedia.org/T92590#1644548 (10fgiunchedi) [10:50:41] 6operations, 10RESTBase, 10RESTBase-Cassandra: use non-default credentials when authenticating to Cassandra - https://phabricator.wikimedia.org/T92590#1115527 (10fgiunchedi) [10:50:44] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1644550 (10fgiunchedi) [10:51:59] _joe_: sweet, I'll take a look shortly [10:52:09] !log catrope@tin Synchronized php-1.26wmf23: (no message) (duration: 02m 04s) [10:52:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [10:55:25] (03CR) 10Aude: [C: 04-1] Whitelist m.wikidata.org for central auth (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238394 (https://phabricator.wikimedia.org/T112087) (owner: 10Bene) [10:58:03] I saw Rpl_semi_sync_master_tx_avg_wait_time = 622 and I was like, WTF? Then I saw that the unit is microseconds :-) [11:00:35] (03PS3) 10Bene: Whitelist m.wikidata.org for central auth [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238394 (https://phabricator.wikimedia.org/T112087) [11:02:24] jynus: BTW while I see you here, we have a patch for an artificial PK for those Flow tables https://gerrit.wikimedia.org/r/#/c/238393/ . When would you have time to review & deploy that? [11:02:53] if you are going to add a PK, I will give you priority [11:03:18] (03PS1) 10Muehlenhoff: Enable ferm on mw1019-mw1025 (canary appservers) [puppet] - 10https://gerrit.wikimedia.org/r/238715 [11:03:19] let me check [11:07:40] !log depooled mw1019-mw1025 (to enable ferm) [11:07:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:08:10] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM, will coordinate with Brion for rollout." [puppet] - 10https://gerrit.wikimedia.org/r/234699 (https://phabricator.wikimedia.org/T110707) (owner: 10Brion VIBBER) [11:12:35] 6operations, 10MediaWiki-extensions-GWToolset, 6Multimedia, 7Performance: Undertake a mass upload of 14 million files (1.5 TB) to Commons - https://phabricator.wikimedia.org/T88758#1644561 (10Harej) The project seems to be delayed indefinitely in general. What's another 50 days? :) [11:13:34] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm on mw1019-mw1025 (canary appservers) [puppet] - 10https://gerrit.wikimedia.org/r/238715 (owner: 10Muehlenhoff) [11:13:41] 6operations, 10MediaWiki-extensions-GWToolset, 6Multimedia, 7Performance: Undertake a mass upload of 14 million files (1.5 TB) to Commons - https://phabricator.wikimedia.org/T88758#1644562 (10Reedy) 5Open>3stalled [11:13:53] !log create restbase user on cassandra test cluster [11:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:19:25] (03CR) 10Faidon Liambotis: [C: 04-1] Configure nf_conntrack hash table size and install conntrack check via NRPE (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/238712 (owner: 10Muehlenhoff) [11:21:40] (03Abandoned) 10Faidon Liambotis: Replace user_forward shellout by an Exim LDAP query [puppet] - 10https://gerrit.wikimedia.org/r/164386 (owner: 10Mark Bergsma) [11:24:06] 6operations, 10RESTBase, 10RESTBase-Cassandra: use non-default credentials when authenticating to Cassandra - https://phabricator.wikimedia.org/T92590#1644581 (10fgiunchedi) plan: * create restbase user on labs/staging/production clusters * grant permissions accordingly * switch user/password in puppet for r... [11:24:58] !log making db1069 a sibling of db1055 (s1) [11:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:28:09] (03PS2) 10Muehlenhoff: Configure nf_conntrack hash table size and install conntrack check via NRPE [puppet] - 10https://gerrit.wikimedia.org/r/238712 [11:32:00] (03CR) 10Faidon Liambotis: [C: 031] "LGTM for now, although at some point we should abstract it and don't repeat ourselves in two different places." [puppet] - 10https://gerrit.wikimedia.org/r/238712 (owner: 10Muehlenhoff) [11:32:45] !log repooled mw1019-mw1025 with ferm enabled [11:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [11:35:26] 6operations, 10Analytics: Moving analysis data from flourine to analytics cluster - https://phabricator.wikimedia.org/T112744#1644598 (10Addshore) 3NEW [11:35:47] 6operations, 10Analytics: Moving analysis data from flourine to analytics cluster - https://phabricator.wikimedia.org/T112744#1644605 (10Addshore) [11:36:31] 6operations, 10Analytics: Moving analysis data from flourine to analytics cluster - https://phabricator.wikimedia.org/T112744#1644598 (10Addshore) [11:36:45] (03CR) 10Filippo Giunchedi: [C: 031] "nice work!" [debs/pybal] - 10https://gerrit.wikimedia.org/r/238152 (https://phabricator.wikimedia.org/T102394) (owner: 10Giuseppe Lavagetto) [11:37:26] 6operations, 7Mail: Protect incoming emails with SMTP STARTLS - https://phabricator.wikimedia.org/T101452#1644609 (10faidon) [11:39:47] 6operations, 10Wikimedia-Mailing-lists, 7Mail: Enable STARTTLS (both inbound and outbound) on lists - https://phabricator.wikimedia.org/T82576#1644611 (10faidon) [12:11:46] 6operations, 7HHVM, 5Patch-For-Review: Package and deploy HHVM 3.6.5+dfsg1-1+wm3 - https://phabricator.wikimedia.org/T112640#1644646 (10Joe) [12:14:11] 6operations, 10ops-eqiad, 5Patch-For-Review: mw1061 has a faulty disk, filesystem is read-only - https://phabricator.wikimedia.org/T107849#1644649 (10MoritzMuehlenhoff) mw1061 is still de-pooled with a link to this ticket, though? Or is there anything else needed to bring it back? [12:15:55] 6operations, 10ops-eqiad, 5Patch-For-Review: mw1061 has a faulty disk, filesystem is read-only - https://phabricator.wikimedia.org/T107849#1644650 (10Joe) Nope, if it's still depooled it's because I forgot to repool it, I guess. [12:23:41] 6operations, 10hardware-requests: Request three servers for Pageview API - https://phabricator.wikimedia.org/T111053#1644653 (10akosiaris) >>! In T111053#1642679, @Ottomata wrote: > @akosiaris we need to know: > > - Is aqs100x ok for a name I got nothing better, so yes. > - What VLAN should I put these in (... [12:33:04] (03PS1) 10Muehlenhoff: Enable ferm for mw1115-mw1119 [puppet] - 10https://gerrit.wikimedia.org/r/238730 [12:42:12] !log depooling mw1115-mw1117, mw1119 (mw1118 was already depooled) to enable ferm [12:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:43:19] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm for mw1115-mw1119 [puppet] - 10https://gerrit.wikimedia.org/r/238730 (owner: 10Muehlenhoff) [12:48:06] !log repooled mw1115-mw1117, mw1119 (with ferm enabled) [12:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [12:50:50] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1644690 (10Eevans) >>! In T108613#1644499, @fgiunchedi wrote: >>>! In T108613#1644354, @fgiunchedi wrote: >> enabled hyperthreading on 2002 and rebooted, w... [12:51:47] (03CR) 10Zfilipin: [C: 04-1] "JanZerebecki: Should this patch be abandoned? As you said, submodules are not cloned:" [puppet] - 10https://gerrit.wikimedia.org/r/238471 (https://phabricator.wikimedia.org/T102020) (owner: 10JanZerebecki) [12:52:44] 6operations, 10RESTBase, 10RESTBase-Cassandra, 5Patch-For-Review: Set up multi-DC replication for Cassandra - https://phabricator.wikimedia.org/T108613#1644692 (10Eevans) >>! In T108613#1644690, @Eevans wrote: >>>! In T108613#1644499, @fgiunchedi wrote: >>>>! In T108613#1644354, @fgiunchedi wrote: >>> enab... [12:54:11] 7Puppet, 10Continuous-Integration-Config, 5Patch-For-Review: Setup rubocop for operations/puppet ruby code lints - https://phabricator.wikimedia.org/T102020#1644695 (10zeljkofilipin) Well, we do not even need to explicitly ignore git submodules, since the Jenkins job does not even clone them. :D https://int... [12:58:03] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [500.0] [12:59:22] (03PS1) 10Jcrespo: Depool db1055 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238734 [13:01:37] (03CR) 10Jcrespo: [C: 032] Depool db1055 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238734 (owner: 10Jcrespo) [13:03:26] !log disabling asw-d-eqiad xe-8/0/23, xe-8/0/24, xe-8/0/25, xe-8/0/26, xe-8/0/27, xe-8/0/28; servers reboot-looping -> asw-d's SNMP unhappy -> librenms unhappy -> faidon's mailbox unhappy [13:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:05:19] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [13:10:52] <_joe_> ottomata: interview? [13:10:53] <_joe_> :) [13:10:59] 6operations: sysctl::parameters don't take effect until next reboot (on Trusty at least) - https://phabricator.wikimedia.org/T109711#1644731 (10MoritzMuehlenhoff) 5Open>3Resolved a:3MoritzMuehlenhoff My recent patch to change the conntrack table size worked fine across the fleet, so I'm closing this bug. [13:11:02] i know i just realized! [13:15:00] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool es1055 for maintenance (duration: 00m 12s) [13:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:17:09] reverting [13:17:31] (03PS1) 10Jcrespo: Revert "Depool db1055 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238736 [13:18:18] (03CR) 10Jcrespo: [C: 032] Revert "Depool db1055 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238736 (owner: 10Jcrespo) [13:19:30] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Reverting depool of es1055 (duration: 00m 12s) [13:19:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:23:43] (03PS1) 10Muehlenhoff: enable ferm on mw1149-mw1151 [puppet] - 10https://gerrit.wikimedia.org/r/238737 [13:24:25] !log depooled mw1149-mw1151 (for enabling ferm) [13:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:25:25] (03PS1) 10Eevans: configure datacenter set [puppet] - 10https://gerrit.wikimedia.org/r/238738 (https://phabricator.wikimedia.org/T76494) [13:26:17] (03CR) 10Muehlenhoff: [C: 032 V: 032] enable ferm on mw1149-mw1151 [puppet] - 10https://gerrit.wikimedia.org/r/238737 (owner: 10Muehlenhoff) [13:33:47] 6operations, 6Performance-Team, 6Scrum-of-Scrums: Define SLAs for media - https://phabricator.wikimedia.org/T112692#1644772 (10Dzahn) [13:35:11] !log repooled mw1149-mw1151 (with ferm enabled) [13:35:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:36:44] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: puppet fail [13:37:35] (03PS4) 10Hoo man: Add m.wikidata.org to wgCrossSiteAJAXdomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238394 (https://phabricator.wikimedia.org/T112087) (owner: 10Bene) [13:40:05] !log catrope@tin Synchronized php-1.26wmf23: (no message) (duration: 01m 37s) [13:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:40:31] !log initiating Cassandra repair on restbase1007 (nodetool repair -pr) [13:40:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [13:41:16] (03PS4) 10Andrew Bogott: Nova: remove_unused_base_images=True [puppet] - 10https://gerrit.wikimedia.org/r/238515 [13:42:27] (03CR) 10Andrew Bogott: [C: 032] Nova: remove_unused_base_images=True [puppet] - 10https://gerrit.wikimedia.org/r/238515 (owner: 10Andrew Bogott) [13:49:36] 6operations, 6Performance-Team: Define SLAs for media - https://phabricator.wikimedia.org/T112692#1644791 (10faidon) [13:50:05] mutante: ?? [13:54:57] (03PS1) 10Jcrespo: Depool db1051 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238742 [13:55:48] (03CR) 10Jcrespo: [C: 032] Depool db1051 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238742 (owner: 10Jcrespo) [13:57:21] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool es1051 for maintenance (duration: 00m 12s) [13:57:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:01:49] 6operations, 10hardware-requests: eqiad: (1) hardware request for ElasticSearch replication to Labs - 4 weeks use - https://phabricator.wikimedia.org/T112163#1644799 (10mark) >>! In T112163#1642184, @EBernhardson wrote: > there is also the possibility of using the old lsearchd cluster (but they are 1.5yrs out... [14:03:38] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:09:30] !log upgrading and restarting db1051 [14:09:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:13:45] paravoid: what's up [14:14:34] oh, i thought SoS because it needs talk between the 2 teams [14:14:39] nevermind then [14:18:03] !log disabling/ignoring asw-d-eqiad @ librenms [14:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:18:10] (03CR) 10Hashar: [C: 031] "Yah I mixed Verified and Code-Review labels :( That is ready to go." [puppet] - 10https://gerrit.wikimedia.org/r/238491 (owner: 10Hashar) [14:27:23] !log asw-d-eqiad: toggling RE mastership [14:27:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:28:29] (03PS1) 10Muehlenhoff: Enable ferm for initial jobrunner in eqiad (mw1010) [puppet] - 10https://gerrit.wikimedia.org/r/238748 [14:32:35] (03CR) 10Muehlenhoff: [C: 032 V: 032] Enable ferm for initial jobrunner in eqiad (mw1010) [puppet] - 10https://gerrit.wikimedia.org/r/238748 (owner: 10Muehlenhoff) [14:35:22] 6operations, 10Salt: various salt-minions are not replying to test.ping or commands - https://phabricator.wikimedia.org/T102808#1644879 (10ArielGlenn) 5Open>3Resolved checked with several runs with delay between them, the only two hosts which fail to respond are down or ssh in fails, e.g.: root@palladium... [14:38:02] (03PS1) 10BBlack: lvs1007-12 DNS entries, all vlans [dns] - 10https://gerrit.wikimedia.org/r/238750 (https://phabricator.wikimedia.org/T104458) [14:42:48] PROBLEM - puppet last run on mw2006 is CRITICAL: CRITICAL: puppet fail [14:44:11] 6operations, 10hardware-requests: eqiad: (1) hardware request for ElasticSearch replication to Labs - 4 weeks use - https://phabricator.wikimedia.org/T112163#1644906 (10Cmjohnson) @yuvipanda 12 servers in row B is not an option due to lack of space and need for labs. [14:44:32] (03CR) 10MarcoAurelio: [C: 04-1] "I think it's better to wait until T112751 is resolved." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236231 (https://phabricator.wikimedia.org/T110619) (owner: 10MarcoAurelio) [14:45:18] (03CR) 10MarcoAurelio: [C: 04-1] "I think it's better to wait until T112751 can be addressed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/236418 (https://phabricator.wikimedia.org/T111630) (owner: 10MarcoAurelio) [14:45:44] !log enabled ferm on mw1010 (jobrunner) in eqiad [14:45:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:51:48] <_joe_> !log experimenting on testwiki for poolcounter failure scenarios [14:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [14:51:55] 6operations, 10ops-eqiad, 10netops: cr2-eqiad PEM 2 failure - https://phabricator.wikimedia.org/T112000#1644934 (10Cmjohnson) [[ URL | name ]] attached shipping label [14:54:07] RECOVERY - OTRS SMTP on mendelevium is OK: SMTP OK - 0.017 sec. response time [14:54:28] RECOVERY - configured eth on mendelevium is OK: OK - interfaces up [14:54:28] RECOVERY - SSH on mendelevium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [14:54:47] RECOVERY - RAID on mendelevium is OK: OK: no RAID installed [14:54:48] RECOVERY - dhclient process on mendelevium is OK: PROCS OK: 0 processes with command name dhclient [14:54:57] RECOVERY - DPKG on mendelevium is OK: All packages OK [14:55:08] RECOVERY - Check size of conntrack table on mendelevium is OK: OK: nf_conntrack is 0 % full [14:55:17] RECOVERY - salt-minion processes on mendelevium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [14:55:17] RECOVERY - Disk space on mendelevium is OK: DISK OK [14:55:22] ACKNOWLEDGEMENT - Disk space on labstore1002 is CRITICAL: DISK CRITICAL - /run/lock/storage-replicate-labstore-maps/snapshot is not accessible: Permission denied Coren Known issue with the check debugging this is on todays radar. [14:56:35] 6operations, 10ops-eqiad, 6Labs, 3Labs-Sprint-114, 3ToolLabs-Goals-Q4: Make certain ports and cables between the labstores and shelves are numbered/named and labeled, and make sure that the diagram(s) reflect that. - https://phabricator.wikimedia.org/T112549#1644947 (10coren) 5Open>3Resolved Diagram... [14:57:10] (03PS3) 10Rush: elasticsearch: apply elasticsearch::server role to codfw [puppet] - 10https://gerrit.wikimedia.org/r/238616 [15:00:04] anomie ostriches marktraceur Krenair: Respected human, time to deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150916T1500). Please do the needful. [15:00:05] hoo: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [15:01:00] hoo: Looks like it's just you this morning [15:01:13] ostriches: Ok [15:01:29] I can do it myself, or you can... whatever you prefer [15:01:57] I screwed up my git! Wherps. [15:03:25] (03PS1) 10BBlack: lvs1007-12 basic puppetization [puppet] - 10https://gerrit.wikimedia.org/r/238766 (https://phabricator.wikimedia.org/T104458) [15:03:40] (03CR) 10Chad: [C: 032] Add m.wikidata.org to wgCrossSiteAJAXdomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238394 (https://phabricator.wikimedia.org/T112087) (owner: 10Bene) [15:04:12] (03Merged) 10jenkins-bot: Add m.wikidata.org to wgCrossSiteAJAXdomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238394 (https://phabricator.wikimedia.org/T112087) (owner: 10Bene) [15:04:19] (03CR) 10jenkins-bot: [V: 04-1] lvs1007-12 basic puppetization [puppet] - 10https://gerrit.wikimedia.org/r/238766 (https://phabricator.wikimedia.org/T104458) (owner: 10BBlack) [15:04:27] (03PS4) 10Rush: elasticsearch: apply elasticsearch::server role to codfw [puppet] - 10https://gerrit.wikimedia.org/r/238616 [15:05:08] !log demon@tin Synchronized wmf-config/CommonSettings.php: Add m.wikidata.org to wgCrossSiteAJAXdomains (duration: 00m 12s) [15:05:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:05:19] hoo: ^^ [15:05:23] (03PS2) 10BBlack: lvs1007-12 basic puppetization [puppet] - 10https://gerrit.wikimedia.org/r/238766 (https://phabricator.wikimedia.org/T104458) [15:05:30] ostriches: Thanks \o/ [15:05:41] yw [15:06:39] (03PS1) 10Hashar: nodepool: stop logging apscheduler [puppet] - 10https://gerrit.wikimedia.org/r/238768 [15:09:08] ostriches: Verified [15:09:48] (03CR) 10Giuseppe Lavagetto: [C: 032] Use "reload" instead of "force-reload" from logrotate [debs/pybal] - 10https://gerrit.wikimedia.org/r/237986 (https://phabricator.wikimedia.org/T112457) (owner: 10Faidon Liambotis) [15:09:51] (03CR) 10Rush: [C: 032] elasticsearch: apply elasticsearch::server role to codfw [puppet] - 10https://gerrit.wikimedia.org/r/238616 (owner: 10Rush) [15:10:02] 6operations, 6Discovery, 5codfw-rollout: Set up a CirrusSearch cluster in codfw (Dallas, Texas) - https://phabricator.wikimedia.org/T105703#1644997 (10chasemp) [15:10:48] !log Started using Nodepool spawned instances. Moved integration-jjb-config-diff Jenkins job to Nodepool with https://gerrit.wikimedia.org/r/#/c/238752/ . See also: https://phabricator.wikimedia.org/T112750 [15:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:11:41] RECOVERY - puppet last run on mw2006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:13:30] RECOVERY - NTP on mendelevium is OK: NTP OK: Offset -0.001968860626 secs [15:13:54] 6operations, 5Patch-For-Review: Change distribution in releases.wikimedia.org to "sid" or "jessie" - https://phabricator.wikimedia.org/T111225#1645023 (10cscott) @gwicke: can I point Parsoid users to releases.wikimedia.org now? [15:18:23] (03CR) 10BBlack: [C: 032] lvs1007-12 DNS entries, all vlans [dns] - 10https://gerrit.wikimedia.org/r/238750 (https://phabricator.wikimedia.org/T104458) (owner: 10BBlack) [15:19:55] (03PS3) 10BBlack: lvs1007-12 basic puppetization [puppet] - 10https://gerrit.wikimedia.org/r/238766 (https://phabricator.wikimedia.org/T104458) [15:20:09] (03CR) 10BBlack: [C: 032 V: 032] lvs1007-12 basic puppetization [puppet] - 10https://gerrit.wikimedia.org/r/238766 (https://phabricator.wikimedia.org/T104458) (owner: 10BBlack) [15:20:28] (03PS2) 10BBlack: dhcp for lvs1012 [puppet] - 10https://gerrit.wikimedia.org/r/238528 (https://phabricator.wikimedia.org/T104458) [15:20:34] (03CR) 10BBlack: [C: 032 V: 032] dhcp for lvs1012 [puppet] - 10https://gerrit.wikimedia.org/r/238528 (https://phabricator.wikimedia.org/T104458) (owner: 10BBlack) [15:23:05] (03PS1) 10Giuseppe Lavagetto: pybal (1.10) jessie-wikimedia; urgency=medium [debs/pybal] - 10https://gerrit.wikimedia.org/r/238770 [15:23:32] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] pybal (1.10) jessie-wikimedia; urgency=medium [debs/pybal] - 10https://gerrit.wikimedia.org/r/238770 (owner: 10Giuseppe Lavagetto) [15:24:25] !log uploaded debdeploy 0.0.6 to apt.wikimedia.org [15:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:25:49] (03CR) 10Zfilipin: contint: remove obsolete ruby related packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/238436 (owner: 10Hashar) [15:30:20] PROBLEM - ElasticSearch health check for shards on elastic2004 is CRITICAL: CRITICAL - elasticsearch http://10.192.0.133:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.192.0.133, port=9200): Max retries exceeded with url: /_cluster/health (Caused by class socket.error: [Errno 111] Connection refused) [15:30:20] PROBLEM - ElasticSearch health check for shards on elastic2017 is CRITICAL: CRITICAL - elasticsearch http://10.192.32.122:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.192.32.122, port=9200): Max retries exceeded with url: /_cluster/health (Caused by class socket.error: [Errno 111] Connection refused) [15:30:26] (03CR) 10Eevans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/238431 (https://phabricator.wikimedia.org/T112644) (owner: 10Filippo Giunchedi) [15:30:52] PROBLEM - ElasticSearch health check for shards on elastic2020 is CRITICAL: CRITICAL - elasticsearch http://10.192.48.32:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.192.48.32, port=9200): Max retries exceeded with url: /_cluster/health (Caused by class socket.error: [Errno 111] Connection refused) [15:31:21] (03PS4) 10Zfilipin: WIP Move Ruby related packages to a separate file [puppet] - 10https://gerrit.wikimedia.org/r/237876 (https://phabricator.wikimedia.org/T110865) [15:31:49] PROBLEM - ElasticSearch health check for shards on elastic2001 is CRITICAL: CRITICAL - elasticsearch http://10.192.0.130:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.192.0.130, port=9200): Max retries exceeded with url: /_cluster/health (Caused by class socket.error: [Errno 111] Connection refused) [15:31:49] PROBLEM - ElasticSearch health check for shards on elastic2019 is CRITICAL: CRITICAL - elasticsearch http://10.192.48.31:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.192.48.31, port=9200): Max retries exceeded with url: /_cluster/health (Caused by class socket.error: [Errno 111] Connection refused) [15:31:51] (03CR) 10Eevans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/238432 (https://phabricator.wikimedia.org/T112644) (owner: 10Filippo Giunchedi) [15:31:55] PROBLEM - ElasticSearch health check for shards on elastic2002 is CRITICAL: CRITICAL - elasticsearch http://10.192.0.131:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.192.0.131, port=9200): Max retries exceeded with url: /_cluster/health (Caused by class socket.error: [Errno 111] Connection refused) [15:32:15] PROBLEM - ElasticSearch health check for shards on elastic2008 is CRITICAL: CRITICAL - elasticsearch http://10.192.16.144:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.192.16.144, port=9200): Max retries exceeded with url: /_cluster/health (Caused by class socket.error: [Errno 111] Connection refused) [15:32:25] PROBLEM - ElasticSearch health check for shards on elastic2021 is CRITICAL: CRITICAL - elasticsearch http://10.192.48.33:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.192.48.33, port=9200): Max retries exceeded with url: /_cluster/health (Caused by class socket.error: [Errno 111] Connection refused) [15:32:36] PROBLEM - ElasticSearch health check for shards on elastic2013 is CRITICAL: CRITICAL - elasticsearch http://10.192.32.118:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.192.32.118, port=9200): Max retries exceeded with url: /_cluster/health (Caused by class socket.error: [Errno 111] Connection refused) [15:32:45] PROBLEM - ElasticSearch health check for shards on elastic2022 is CRITICAL: CRITICAL - elasticsearch http://10.192.48.34:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.192.48.34, port=9200): Max retries exceeded with url: /_cluster/health (Caused by class socket.error: [Errno 111] Connection refused) [15:32:55] PROBLEM - ElasticSearch health check for shards on elastic2005 is CRITICAL: CRITICAL - elasticsearch http://10.192.0.134:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.192.0.134, port=9200): Max retries exceeded with url: /_cluster/health (Caused by class socket.error: [Errno 111] Connection refused) [15:34:02] uh [15:34:06] hello? [15:34:23] looks like codfw [15:34:29] but please ack them, not silence them [15:34:32] ahh, yeah, didn't look [15:34:34] (chasemp: I assume) [15:34:48] yes sorry it got out before I could silence it [15:34:56] PROBLEM - ElasticSearch health check for shards on elastic2006 is CRITICAL: CRITICAL - elasticsearch http://10.192.0.135:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.192.0.135, port=9200): Max retries exceeded with url: /_cluster/health (Caused by class socket.error: [Errno 111] Connection refused) [15:34:56] PROBLEM - ElasticSearch health check for shards on elastic2010 is CRITICAL: CRITICAL - elasticsearch http://10.192.16.146:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.192.16.146, port=9200): Max retries exceeded with url: /_cluster/health (Caused by class socket.error: [Errno 111] Connection refused) [15:35:00] ack them, don't silence them [15:35:08] silenced checks still appear under https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?hosts=all&style=hostservicedetail&hoststatustypes=12&hostprops=2097162&servicestatustypes=28&serviceprops=2097162&nostatusheader [15:35:18] well it's going to flap with setup for awhile [15:35:25] I don't want it to raalert until I'm ready [15:35:30] realert that is [15:35:44] or, why would I want this alerting at all now? [15:36:16] PROBLEM - ElasticSearch health check for shards on elastic2024 is CRITICAL: CRITICAL - elasticsearch http://10.192.48.36:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.192.48.36, port=9200): Max retries exceeded with url: /_cluster/health (Caused by class socket.error: [Errno 111] Connection refused) [15:36:20] (03PS1) 10Chad: Remove push.default = simple from .gitconfig for now [puppet] - 10https://gerrit.wikimedia.org/r/238773 [15:36:22] (03CR) 10Zfilipin: WIP Move Ruby related packages to a separate file (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/237876 (https://phabricator.wikimedia.org/T110865) (owner: 10Zfilipin) [15:36:47] (03PS2) 10Chad: Remove push.default = simple from .gitconfig for now [puppet] - 10https://gerrit.wikimedia.org/r/238773 [15:36:56] PROBLEM - ElasticSearch health check for shards on elastic2007 is CRITICAL: CRITICAL - elasticsearch http://10.192.16.143:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.192.16.143, port=9200): Max retries exceeded with url: /_cluster/health (Caused by class socket.error: [Errno 111] Connection refused) [15:37:03] Someone got a second on ^? I kinda fubar'd my own git [15:39:06] ostriches: wassup? [15:39:21] oh, you need opsen [15:39:42] Yeah, puppetz. [15:40:17] (03CR) 10Zfilipin: WIP Move Ruby related packages to a separate file (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/237876 (https://phabricator.wikimedia.org/T110865) (owner: 10Zfilipin) [15:40:18] 6operations, 10RESTBase: enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#1645244 (10Eevans) >>! In T112648#1643318, @GWicke wrote: > We have had this discussion a few times now. The basic issue of local logging being blocking (and thus running the risk of taking out a service wh... [15:42:30] chasemp: schedule a downtime then [15:42:43] I'm working on clearing this out yeah [15:42:54] thanks [15:42:57] (03CR) 10JanZerebecki: "Looks good, do not deploy yet." [puppet] - 10https://gerrit.wikimedia.org/r/238396 (https://phabricator.wikimedia.org/T111015) (owner: 10Bene) [15:44:00] <_joe_> !log uploading pybal 1.10 to reprepro, installing to the test cluster [15:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [15:46:34] 6operations, 10RESTBase: enable restbase syslog/file logging - https://phabricator.wikimedia.org/T112648#1645274 (10GWicke) It is solvable, for example by logging to syslog using udp with https://github.com/mcavage/node-bunyan-syslog. The vast majority of write-to-file methods however aren't safe against disks... [15:53:18] (03CR) 10Mobrovac: [C: 031] configure datacenter set [puppet] - 10https://gerrit.wikimedia.org/r/238738 (https://phabricator.wikimedia.org/T76494) (owner: 10Eevans) [15:55:01] ottomata: analytics1015 puppet dpkg broken packages [15:55:10] ottomata: and 1027 puppet disabled [15:55:48] broken on an15? hm. 1027 disabled ok. been a busy morn. [15:56:06] paravoid: apt-get is not complaining [15:56:12] how do I see broken packages? [15:56:23] dpkg -l |grep -v ^ii [15:56:29] so there are about a dozen hosts with unaccepted salt keys on palladium, should I just make a ticket for this and start pestering ppl I guess? [15:56:38] ottomata: "rc" are fine too" [15:56:55] dpkg -s mysql-server-5.5 [15:56:58] Status: purge ok config-files [15:57:28] chasemp: arguably, we could put in an icinga check on the salt-master for that (unaccepted keys older than X) [15:57:47] maybe make X a few hours to paper over common installation patterns and whatnot [15:58:39] ah, paravoid, this is because /var/lib/mysql is a mount. [15:58:41] gah, hm. [15:59:27] bblack: yeah seems sane, for now though not that I sense shenanigans but I don't want to accept these myself without context [15:59:51] 7Blocked-on-Operations, 6operations, 6Phabricator, 10Traffic: Pharicator needs to expose notification daemon (websocket) - https://phabricator.wikimedia.org/T112765#1645318 (10Dzahn) 3NEW [16:01:28] 7Blocked-on-Operations, 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose ssh and notification daemon (websocket) - https://phabricator.wikimedia.org/T100519#1314620 (10Dzahn) >>! In T100519#1630955, @mmodell wrote: > Can anyone comment on how websockets fit into the current plan? Does that... [16:01:39] RECOVERY - DPKG on analytics1015 is OK: All packages OK [16:03:47] chasemp: yeah whoever actually installed the box or whatever should accept them [16:03:51] 7Blocked-on-Operations, 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1645347 (10Dzahn) [16:03:56] but we could make it gripe at us automatically [16:04:18] root@palladium:~# find /etc/salt/pki/master/minions_pre -type f -a -mtime +1 [16:04:28] ^ finds unaccepted keys sitting around for 1-2 days+ [16:04:38] PROBLEM - ElasticSearch health check for shards on elastic2012 is CRITICAL: CRITICAL - elasticsearch http://10.192.16.148:9200/_cluster/health error while fetching: HTTPConnectionPool(host=10.192.16.148, port=9200): Read timed out. (read timeout=4) [16:05:15] icinga is so slow I'm having trouble scheduling downtime fast enough here folks sorry [16:05:47] (03PS2) 10Andrew Bogott: nodepool: bump # of instances [puppet] - 10https://gerrit.wikimedia.org/r/238491 (owner: 10Hashar) [16:05:51] !log upgrading pybal on lvs200[456] [16:05:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:06:53] (03CR) 10Andrew Bogott: [C: 032] nodepool: bump # of instances [puppet] - 10https://gerrit.wikimedia.org/r/238491 (owner: 10Hashar) [16:07:01] mutante: Can you poke https://gerrit.wikimedia.org/r/#/c/238773/? It's a followup to Monday...I screwed up my own git [16:07:53] ostriches: oh. sure [16:08:09] (03PS3) 10Dzahn: Remove push.default = simple from .gitconfig for now [puppet] - 10https://gerrit.wikimedia.org/r/238773 (owner: 10Chad) [16:08:49] !log upgrading pybal on lvs200[123] [16:08:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:09:59] mutante: Ty sir [16:10:00] (03CR) 10Dzahn: [C: 032] Remove push.default = simple from .gitconfig for now [puppet] - 10https://gerrit.wikimedia.org/r/238773 (owner: 10Chad) [16:10:33] ostriches: np! [16:10:35] 7Blocked-on-Operations, 6operations, 6Phabricator, 10Traffic: Phabricator needs to expose ssh - https://phabricator.wikimedia.org/T100519#1645361 (10mmodell) @dzahn, thanks! [16:11:05] (03PS2) 10Andrew Bogott: nodepool: stop logging apscheduler [puppet] - 10https://gerrit.wikimedia.org/r/238768 (owner: 10Hashar) [16:11:24] (03PS1) 10Chad: Allow top-level logger to track lower level git operations [tools/scap] - 10https://gerrit.wikimedia.org/r/238777 [16:11:52] (03CR) 10jenkins-bot: [V: 04-1] Allow top-level logger to track lower level git operations [tools/scap] - 10https://gerrit.wikimedia.org/r/238777 (owner: 10Chad) [16:12:12] (03CR) 10Andrew Bogott: [C: 032] nodepool: stop logging apscheduler [puppet] - 10https://gerrit.wikimedia.org/r/238768 (owner: 10Hashar) [16:12:46] !log upgrading pybal on lvs400[34], lvs300[34] [16:12:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:14:15] (03PS1) 10Zfilipin: rubocop: Fixed Style/WordArray offense [puppet] - 10https://gerrit.wikimedia.org/r/238778 (https://phabricator.wikimedia.org/T112651) [16:14:17] (03PS2) 10Chad: Allow top-level logger to track lower level git operations [tools/scap] - 10https://gerrit.wikimedia.org/r/238777 [16:14:26] 6operations: mystery palladium unaccepted salt keys - https://phabricator.wikimedia.org/T112767#1645371 (10chasemp) 3NEW [16:15:19] i like how salt-key has an option called --hard-crash [16:16:07] !log upgrading pybal on lvs400[12] [16:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:17:29] 6operations: mystery palladium unaccepted salt keys - https://phabricator.wikimedia.org/T112767#1645384 (10Dzahn) Key for minion fermium.eqiad.wmnet deleted. that was because we first installed it with a private IP and now it's in .wikimedia.org. removed [16:17:29] mutante: I didn't see how to make -L give me a timestamp for the key request [16:17:36] idea on how to do that? [16:18:20] not yet, but cleaned up one of them up, and deleting another one [16:18:28] analtyics = typo [16:18:47] 6operations, 5Patch-For-Review, 7Pybal: jessie pybals get restarted every day by logrotate, resetting BGP sessions - https://phabricator.wikimedia.org/T112457#1645387 (10BBlack) a:3Joe [16:19:09] (03PS3) 10Chad: Allow top-level logger to track lower level git operations [tools/scap] - 10https://gerrit.wikimedia.org/r/238777 [16:19:13] 6operations, 10Traffic, 5Patch-For-Review: Upgrade codfw,ulsfo,esams LVS to jessie - https://phabricator.wikimedia.org/T96375#1645391 (10BBlack) [16:19:14] 6operations, 5Patch-For-Review, 7Pybal: jessie pybals get restarted every day by logrotate, resetting BGP sessions - https://phabricator.wikimedia.org/T112457#1645389 (10BBlack) 5Open>3Resolved package upgraded w/ faidon's fix on all jessie LVS now. [16:19:39] (03PS1) 10Zfilipin: rubocop: Fixed Style/TrailingComma offense [puppet] - 10https://gerrit.wikimedia.org/r/238779 (https://phabricator.wikimedia.org/T112651) [16:19:52] (03CR) 10Zfilipin: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/238778 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [16:20:14] 6operations: mystery palladium unaccepted salt keys - https://phabricator.wikimedia.org/T112767#1645393 (10Dzahn) Key for minion analtyics1046.eqiad.wmnet deleted. <-- typo, removed [16:20:18] (03CR) 10Zfilipin: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/238779 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [16:21:00] (03PS2) 10Giuseppe Lavagetto: Use backported ffmpeg for multimedia transcoding on Trusty [puppet] - 10https://gerrit.wikimedia.org/r/234699 (https://phabricator.wikimedia.org/T110707) (owner: 10Brion VIBBER) [16:22:00] (03PS4) 10Chad: Allow top-level logger to track lower level git operations [tools/scap] - 10https://gerrit.wikimedia.org/r/238777 [16:22:03] 6operations: mystery palladium unaccepted salt keys - https://phabricator.wikimedia.org/T112767#1645396 (10Andrew) I fixed labsdb1004.eqiad.wmnet [16:22:13] (03CR) 10jenkins-bot: [V: 04-1] Allow top-level logger to track lower level git operations [tools/scap] - 10https://gerrit.wikimedia.org/r/238777 (owner: 10Chad) [16:23:02] (03CR) 10Zfilipin: "I am not sure that this RuboCop rule is good for this repository. I think Puppet convention is to leave trailing commas. Should I change t" [puppet] - 10https://gerrit.wikimedia.org/r/238779 (https://phabricator.wikimedia.org/T112651) (owner: 10Zfilipin) [16:23:17] (03PS5) 10Chad: Allow top-level logger to track lower level git operations [tools/scap] - 10https://gerrit.wikimedia.org/r/238777 [16:25:36] (03PS1) 10Brion VIBBER: Use ffmpeg for video thumbs & transcoding [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238781 (https://phabricator.wikimedia.org/T53313) [16:25:53] 7Puppet, 5Patch-For-Review: Fix easy problems reported by RuboCop in operations/puppet - https://phabricator.wikimedia.org/T112651#1645407 (10zeljkofilipin) [16:26:19] 6operations, 10fundraising-tech-ops: reformulate kafkatee package to work with Trusty - https://phabricator.wikimedia.org/T110591#1645409 (10Jgreen) working fine [16:26:42] 6operations, 10fundraising-tech-ops: reformulate kafkatee package to work with Trusty - https://phabricator.wikimedia.org/T110591#1645412 (10Jgreen) 5Open>3Resolved [16:27:32] (03PS6) 10Chad: Allow top-level logger to track lower level git operations [tools/scap] - 10https://gerrit.wikimedia.org/r/238777 [16:27:35] (03PS1) 10Chad: Use HHVM's logger in its sudo_check_call [tools/scap] - 10https://gerrit.wikimedia.org/r/238782 [16:27:42] (03CR) 10Giuseppe Lavagetto: [C: 032] Use backported ffmpeg for multimedia transcoding on Trusty [puppet] - 10https://gerrit.wikimedia.org/r/234699 (https://phabricator.wikimedia.org/T110707) (owner: 10Brion VIBBER) [16:27:46] (03CR) 10Andrew Bogott: [C: 031] "Probably best to merge this when I'm around to watch :)" [puppet] - 10https://gerrit.wikimedia.org/r/238712 (owner: 10Muehlenhoff) [16:28:02] \o/ [16:31:05] (03CR) 10Brion VIBBER: [C: 032] "joe said I could merge this :D We're watching the servers to make sure nothing explodes." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238781 (https://phabricator.wikimedia.org/T53313) (owner: 10Brion VIBBER) [16:31:18] (03CR) 1020after4: [C: 032] Execute distinct stages of deployment separately [tools/scap] - 10https://gerrit.wikimedia.org/r/238631 (https://phabricator.wikimedia.org/T109861) (owner: 10Dduvall) [16:33:25] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, 5Patch-For-Review: Support VP9 in TMH (Unable to decode) - https://phabricator.wikimedia.org/T55863#1645447 (10Paladox) Since some patches have been merged how can I find out where a vp9 video is to test. [16:33:44] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, 5Patch-For-Review: Support VP9 in TMH (Unable to decode) - https://phabricator.wikimedia.org/T55863#1645448 (10Paladox) Doesent timedmediahandler need to enable it since it is disabled by default. [16:35:11] (03PS1) 10Ottomata: Run 1 EventLogging processor until bug is fixed [puppet] - 10https://gerrit.wikimedia.org/r/238784 (https://phabricator.wikimedia.org/T112688) [16:35:15] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, 5Patch-For-Review: Support VP9 in TMH (Unable to decode) - https://phabricator.wikimedia.org/T55863#1645458 (10brion) No, nothing needs to be enabled on the MediaWiki/TMH end to *decode* VP9 videos -- it simply needs to be present in the a... [16:36:06] (03CR) 10Ottomata: [C: 032] Run 1 EventLogging processor until bug is fixed [puppet] - 10https://gerrit.wikimedia.org/r/238784 (https://phabricator.wikimedia.org/T112688) (owner: 10Ottomata) [16:36:17] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, 5Patch-For-Review: Support VP9 in TMH (Unable to decode) - https://phabricator.wikimedia.org/T55863#1645470 (10Paladox) Oh ok but to watch videos. you have to set $wgEnabledTranscodeSet [16:38:50] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, 5Patch-For-Review: Support VP9 in TMH (Unable to decode) - https://phabricator.wikimedia.org/T55863#1645480 (10Paladox) It shows this error now '/usr/bin/avconv' -y -i '/tmp/localcopy_bffa71b3af37-1.webm' -threads 2 -skip_threshold 0 -bu... [16:41:42] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, 5Patch-For-Review: Support VP9 in TMH (Unable to decode) - https://phabricator.wikimedia.org/T55863#1645507 (10Joe) @Paladox where are you testing this? The production thm servers are not updated, you might test it in beta in an hour or so. [16:43:24] (03CR) 10Hashar: "Andrew raised the quotas for the contintcloud project to match 20 instances." [puppet] - 10https://gerrit.wikimedia.org/r/238491 (owner: 10Hashar) [16:44:00] 6operations: Change Google Webmaster password for noc@ - https://phabricator.wikimedia.org/T110951#1645519 (10Jalexander) p:5Low>3High >>! In T110951#1613933, @jcrespo wrote: > Hello, @Jalexander, > > This is probably my fault, but I do not fully understand the request here. If Philippe is leaving, his emai... [16:44:11] (03PS1) 10Chad: sudo_check_call: Improve logging on failures [tools/scap] - 10https://gerrit.wikimedia.org/r/238786 [16:44:14] 6operations, 6Phabricator, 7Database, 5Patch-For-Review: Phabricator creates MySQL connection spikes - https://phabricator.wikimedia.org/T109279#1645522 (10mmodell) 5Open>3Resolved Closing per @jcrespo's comment above. [16:44:39] !log oblivian@tin Synchronized wmf-config/CommonSettings.php: use ffmpeg whereever possible (duration: 00m 12s) [16:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:44:54] <_joe_> brion: ^^ a test might be a good idea [16:45:00] \o/ ok lemme test [16:45:02] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, 5Patch-For-Review: Support VP9 in TMH (Unable to decode) - https://phabricator.wikimedia.org/T55863#1645526 (10Paladox) Ok ok I was testing on commons.wikimedia.org. [16:45:45] ok testing transcodes from vp9 source at https://commons.wikimedia.org/wiki/File:Snowdonia_by_drone.webm [16:45:46] ... [16:46:15] '[matroska,webm @ 0x24067a0] Unknown/unsupported CodecID V_VP9.' [16:46:16] hrm [16:46:29] _joe_: do we still have to switch which machines are in rotation? [16:46:31] <_joe_> brion: no I meant that we didn't break something [16:46:37] <_joe_> and yes, that [16:46:41] ok :D [16:46:52] <_joe_> if you want to test that vp9 works, lemme work in beta for a sec [16:47:19] should be ok in beta already [16:47:19] (03CR) 10Chad: [C: 032] Rename and simplify some git deploy functions [tools/scap] - 10https://gerrit.wikimedia.org/r/236241 (https://phabricator.wikimedia.org/T109514) (owner: 10Dduvall) [16:47:38] <_joe_> brion: ok [16:47:56] <_joe_> so I can bring the HHVM videoscaler up for a bit [16:48:05] <_joe_> it's 1/3 of course [16:48:13] <_joe_> but I can smoketest it [16:48:32] perfect :D thanks! [16:48:48] <_joe_> brion: in which log file should I see transcoding problems on production/fluorine? [16:48:52] (03PS1) 10Ori.livneh: 3.6.5+dfsg1-1+wm[3..5]: backport of updates for D40473 [debs/hhvm] - 10https://gerrit.wikimedia.org/r/238787 [16:48:58] _joe_: ^ [16:49:03] <_joe_> ori: <3 [16:49:16] <_joe_> ori: see, I'm working on the videoscaler to make you happy :) [16:49:19] _joe_: whatever logs job queue output should have that [16:49:22] :D [16:50:04] (03Merged) 10jenkins-bot: Rename and simplify some git deploy functions [tools/scap] - 10https://gerrit.wikimedia.org/r/236241 (https://phabricator.wikimedia.org/T109514) (owner: 10Dduvall) [16:50:07] (03Merged) 10jenkins-bot: Execute distinct stages of deployment separately [tools/scap] - 10https://gerrit.wikimedia.org/r/238631 (https://phabricator.wikimedia.org/T109861) (owner: 10Dduvall) [16:50:20] (03PS1) 10Rush: elastic: codfw eligible master in row a/b/c [puppet] - 10https://gerrit.wikimedia.org/r/238788 [16:50:46] (03PS2) 10Rush: elastic: codfw eligible master in row a/b/c [puppet] - 10https://gerrit.wikimedia.org/r/238788 [16:51:29] _joe_: whatever logs job queue output should have that [16:51:51] you know a piece of infrastructure is in trouble when even brion isn't sure how it works [16:51:57] hehe [16:52:21] <_joe_> brion: btw, the scaler is in rotation, but no transcoding seems to be happening right now [16:52:24] <_joe_> is that possible? [16:52:27] ok lemme run one [16:52:41] _joe_: transcoding is 'spiky', they happen intermittently [16:53:06] ok i just fired off a couple reset transcode jobs [16:53:07] <_joe_> brion: so I could let's say stop the jobrunner on all hosts but the hhvm one for your tests [16:53:26] yeah that's probably easiest way to test [16:54:43] <_joe_> !log turned on the hhvm tmh, stopping the zend ones for testing [16:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [16:55:18] <_joe_> brion: done [16:55:31] <_joe_> if you submit a transcoding job now it should go to hhvm in theory [16:55:44] "Started 26 seconds ago. " that's a good sign [16:55:53] as opposed to dying immediately \o/ :DD [16:56:24] <_joe_> brion: 125148 12856 R 203.3 1.0 1:16.67 ffmpeg [16:56:31] woohoooooo [16:56:38] <_joe_> it is working indeeed, let's see the result [16:56:49] an ogv from a vp9 source: https://upload.wikimedia.org/wikipedia/commons/transcoded/9/90/Snowdonia_by_drone.webm/Snowdonia_by_drone.webm.240p.ogv [16:56:50] \o/ [16:57:03] \o/ \o/ \o/ [16:57:04] that [16:57:05] is [16:57:06] fucking [16:57:08] awesome [16:57:15] good job guys!!11111 [16:57:17] i love it when a plan comes together :D [16:57:18] * ori literally bouncing [16:57:21] <_joe_> ori: [16:57:32] <_joe_> ls /etc/wikimedia-image-scaler :P [16:57:42] _joe_: i know it's late for you -- we're probably safe to leave just one in plae overnight and re-image the others when you have time [16:57:44] <_joe_> it's there on the tmh hosts too [16:57:55] brion: glad it still works with the test files :D [16:58:14] <_joe_> brion: can you reencode a random video maybe? [16:58:29] and a webm vp8 from a vp9 source: https://upload.wikimedia.org/wikipedia/commons/transcoded/9/90/Snowdonia_by_drone.webm/Snowdonia_by_drone.webm.360p.webm [16:58:54] (03PS7) 10Chad: Allow top-level logger to track lower level git operations [tools/scap] - 10https://gerrit.wikimedia.org/r/238777 [16:58:57] (03PS2) 10Chad: Use HHVM's logger in its sudo_check_call [tools/scap] - 10https://gerrit.wikimedia.org/r/238782 [16:59:00] (03PS2) 10Chad: sudo_check_call: Improve logging on failures [tools/scap] - 10https://gerrit.wikimedia.org/r/238786 [16:59:01] 6operations, 10Analytics: Moving analysis data from flourine to analytics cluster - https://phabricator.wikimedia.org/T112744#1645588 (10Ottomata) Yes, we can do this. fluorine already has an rsyncd running that allows stat1002 to copy files. This would just be a matter of adding a cron job to rsync them to... [16:59:06] (03CR) 10Rush: [C: 032] elastic: codfw eligible master in row a/b/c [puppet] - 10https://gerrit.wikimedia.org/r/238788 (owner: 10Rush) [16:59:26] 6operations, 6Discovery, 5codfw-rollout: Set up a CirrusSearch cluster in codfw (Dallas, Texas) - https://phabricator.wikimedia.org/T105703#1645589 (10chasemp) [16:59:29] (03CR) 10Mobrovac: "As this patchset is changing the same files as https://gerrit.wikimedia.org/r/#/c/238431/ it'd be prudent to make it depend on it." [puppet] - 10https://gerrit.wikimedia.org/r/238738 (https://phabricator.wikimedia.org/T76494) (owner: 10Eevans) [16:59:42] _joe_: i'm firing off a couple more random transcodes on https://commons.wikimedia.org/wiki/File:Cracking_a_Parmesan_Wheel-9MAR2013.webm [17:00:09] cracking, not hacking [17:00:17] she's black hat [17:00:20] (03CR) 10Mobrovac: "Errr, actually make it dependent on https://gerrit.wikimedia.org/r/#/c/238432/ :)" [puppet] - 10https://gerrit.wikimedia.org/r/238738 (https://phabricator.wikimedia.org/T76494) (owner: 10Eevans) [17:00:34] hah [17:01:11] I just saw the netflix Chef's Table episode about https://en.wikipedia.org/wiki/Massimo_Bottura a few days ago [17:01:27] there was a lot of parmesan love in that heh, it's really making me crave the stuff now [17:01:53] https://en.wikipedia.org/wiki/Casu_marzu [17:02:50] 6operations, 10Analytics: Moving analysis data from flourine to analytics cluster - https://phabricator.wikimedia.org/T112744#1645597 (10Addshore) I'm guessing we don't want to rsync the archived logs files themselves (as that is basically 800GB of duplicated data) or does 800GB not matter? And if we just set... [17:03:59] <_joe_> parmesan love? [17:04:19] <_joe_> bblack: massimo bottura is a great chef indeed [17:05:09] Kraft Parmigiano-Cellulosis [17:05:47] <_joe_> mutante: don't tell me nothing more [17:05:56] yeah it seemed like half of his show was basically about parmesan [17:06:04] * _joe_ is parmesan :) [17:06:10] heh [17:06:30] (03PS2) 10DCausse: Cirrus: set /langdetect/short-text/ the default langdetect profile [puppet] - 10https://gerrit.wikimedia.org/r/234297 (https://phabricator.wikimedia.org/T110077) [17:07:15] 6operations, 10RESTBase, 10RESTBase-Cassandra: use non-default credentials when authenticating to Cassandra - https://phabricator.wikimedia.org/T92590#1645617 (10Eevans) To summarize a discussion with @fgiunchedi: We will have puppet: * Write a `cqlshrc` file to `/etc/cassandra` that includes the superuser... [17:07:15] <_joe_> brion: so... do you want to live it like it is now? [17:07:30] <_joe_> I do agree but maybe we should announce the switch? [17:07:44] yeah :D [17:07:46] i [17:07:50] 'll drop a quick note on lists [17:10:35] (03PS7) 10DCausse: Disable dynamic scripting in Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/224651 (owner: 10Manybubbles) [17:10:38] 6operations, 10ops-codfw: solve mtp panel issue for row uplinks - https://phabricator.wikimedia.org/T112774#1645632 (10RobH) 3NEW a:3RobH [17:10:38] <_joe_> ok I'll bake a change to make the current condition puppet-persistent [17:10:47] 6operations, 10ops-codfw: solve mtp panel issue for row uplinks - https://phabricator.wikimedia.org/T112774#1645641 (10RobH) this covers old rt ticket https://rt.wikimedia.org/Ticket/Display.html?id=8158 [17:12:26] (03CR) 10DCausse: [C: 04-1] "This will be merged and deployed tomorrow (Thu 17 Sept at 7am UTC)." [puppet] - 10https://gerrit.wikimedia.org/r/224651 (owner: 10Manybubbles) [17:15:14] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, 5Patch-For-Review: Support VP9 in TMH (Unable to decode) - https://phabricator.wikimedia.org/T55863#1645669 (10brion) Ok these should now be working on Commons. @Paladox can you confirm the one you were testing is working now? [17:15:28] 6operations, 10ops-codfw: EQDFW/EQORD Deployment Prep Task - https://phabricator.wikimedia.org/T91077#1645678 (10RobH) 5Open>3Resolved At this point both sites have been deployed (though we are pending the cross connection final patch in our rack for ch2 ntt) the prep part of all of this is resolved. [17:15:59] (03CR) 10Filippo Giunchedi: "good point Marko, I'll disable puppet, merge this and reenable and rebase the statsd one since that's strictly not a blocker for codfw" [puppet] - 10https://gerrit.wikimedia.org/r/238738 (https://phabricator.wikimedia.org/T76494) (owner: 10Eevans) [17:16:46] (03CR) 10GWicke: [C: 04-1] "I really think it would be cleaner to call this something more general like 'cluster', 'cluster_name' or the like, and use it for logstash" [puppet] - 10https://gerrit.wikimedia.org/r/238431 (https://phabricator.wikimedia.org/T112644) (owner: 10Filippo Giunchedi) [17:18:28] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1645798 (10Jgreen) [17:19:22] 6operations: mystery palladium unaccepted salt keys - https://phabricator.wikimedia.org/T112767#1645803 (10Dzahn) etcd1001 and etcd1002 deleted after checking with _joe_ [17:19:39] 6operations, 10Analytics-Cluster, 5Patch-For-Review: Turn off webrequest udp2log instances. - https://phabricator.wikimedia.org/T97294#1645804 (10Jgreen) [17:20:41] ottomata: kafka1022, has that been reinstalled lately? should the salt-key be accepted or deleted? [17:20:46] (03PS2) 10Filippo Giunchedi: configure datacenter set [puppet] - 10https://gerrit.wikimedia.org/r/238738 (https://phabricator.wikimedia.org/T76494) (owner: 10Eevans) [17:20:53] (03CR) 10Filippo Giunchedi: [C: 032 V: 032] configure datacenter set [puppet] - 10https://gerrit.wikimedia.org/r/238738 (https://phabricator.wikimedia.org/T76494) (owner: 10Eevans) [17:24:01] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, 5Patch-For-Review: Support VP9 in TMH (Unable to decode) - https://phabricator.wikimedia.org/T55863#1645832 (10Paladox) @brion seems to be transcoding now. only format not shown is vp9. Should be enabled in timedmediahandler now. [17:24:10] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1645833 (10Jgreen) The new kafkatee-based banner log pipeline is up and running on americium.frack.eqiad.wmnet, w... [17:25:51] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, 5Patch-For-Review: Support VP9 in TMH (Unable to decode) - https://phabricator.wikimedia.org/T55863#1645847 (10Paladox) Can 4k be switched on at Wikimedia since vp9 is now supported. [17:26:03] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, 5Patch-For-Review: Support VP9 in TMH (Unable to decode) - https://phabricator.wikimedia.org/T55863#1645848 (10brion) 5Open>3Resolved a:3brion Great, that'll be tracked over on T63805. Closing this one out. :D [17:26:05] Are there any known Gerrit issues at the moment? It doesn't really want to load for me in Firefox or Chrome. The webinterfaces of Ganglia and Mailman load for me, but also those load very, very slow for me [17:26:36] I do not experience any slowness on the English Wikipedia, for example. [17:26:39] 6operations: Change Google Webmaster password for noc@ - https://phabricator.wikimedia.org/T110951#1645862 (10Dzahn) a:3Dzahn @Jalexander How about this: You reset the password and then save it in a text file in your home directory on a WMF server,owned by just you. Then you just let me know where and i take... [17:26:44] !log stop puppet on restbase* to apply https://gerrit.wikimedia.org/r/#/c/238738/ / merge / reenable puppet [17:26:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:27:00] (03PS3) 10Andrew Bogott: Configure nf_conntrack hash table size and install conntrack check via NRPE [puppet] - 10https://gerrit.wikimedia.org/r/238712 (owner: 10Muehlenhoff) [17:27:05] (03PS1) 10Giuseppe Lavagetto: videoscaler: add previously unpuppetized file [puppet] - 10https://gerrit.wikimedia.org/r/238793 [17:27:07] (03PS1) 10Giuseppe Lavagetto: videoscaler: make the HHVM videoscaler the only one running [puppet] - 10https://gerrit.wikimedia.org/r/238794 (https://phabricator.wikimedia.org/T104747) [17:27:20] <_joe_> brion: care to +1 the second one? [17:27:58] (03CR) 10Brion VIBBER: [C: 031] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/238794 (https://phabricator.wikimedia.org/T104747) (owner: 10Giuseppe Lavagetto) [17:28:00] (03PS2) 10Giuseppe Lavagetto: videoscaler: add previously unpuppetized file [puppet] - 10https://gerrit.wikimedia.org/r/238793 [17:28:03] (03CR) 10Andrew Bogott: [C: 032] Configure nf_conntrack hash table size and install conntrack check via NRPE [puppet] - 10https://gerrit.wikimedia.org/r/238712 (owner: 10Muehlenhoff) [17:28:08] (03CR) 10Giuseppe Lavagetto: [C: 032] videoscaler: add previously unpuppetized file [puppet] - 10https://gerrit.wikimedia.org/r/238793 (owner: 10Giuseppe Lavagetto) [17:28:36] (03PS3) 10Giuseppe Lavagetto: videoscaler: add previously unpuppetized file [puppet] - 10https://gerrit.wikimedia.org/r/238793 [17:28:43] (03CR) 10Giuseppe Lavagetto: [V: 032] videoscaler: add previously unpuppetized file [puppet] - 10https://gerrit.wikimedia.org/r/238793 (owner: 10Giuseppe Lavagetto) [17:29:08] 6operations, 10ops-eqiad, 10hardware-requests: disk degausser for eqiad - https://phabricator.wikimedia.org/T112780#1645879 (10RobH) 3NEW a:3RobH [17:29:52] (03PS2) 10Giuseppe Lavagetto: videoscaler: make the HHVM videoscaler the only one running [puppet] - 10https://gerrit.wikimedia.org/r/238794 (https://phabricator.wikimedia.org/T104747) [17:30:17] 6operations, 10ops-eqiad, 10netops: asw-d-eqiad SNMP failures - https://phabricator.wikimedia.org/T112781#1645892 (10faidon) 3NEW [17:30:21] (03CR) 10Giuseppe Lavagetto: [C: 032 V: 032] videoscaler: make the HHVM videoscaler the only one running [puppet] - 10https://gerrit.wikimedia.org/r/238794 (https://phabricator.wikimedia.org/T104747) (owner: 10Giuseppe Lavagetto) [17:30:38] 6operations, 10ops-eqiad, 10netops: asw-d-eqiad SNMP failures - https://phabricator.wikimedia.org/T112781#1645901 (10faidon) [17:30:39] 6operations, 10ops-eqiad, 10Traffic, 10netops, 5Patch-For-Review: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1645900 (10faidon) [17:31:59] (03PS1) 10BBlack: lvs netboot host regex fixups [puppet] - 10https://gerrit.wikimedia.org/r/238797 [17:32:19] (03CR) 10BBlack: [C: 032 V: 032] lvs netboot host regex fixups [puppet] - 10https://gerrit.wikimedia.org/r/238797 (owner: 10BBlack) [17:34:09] 6operations: mystery palladium unaccepted salt keys - https://phabricator.wikimedia.org/T112767#1645911 (10Krenair) [17:34:27] !log asw-d-eqiad: toggling RE mastership again [17:34:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [17:34:47] 6operations, 10ops-codfw: wipe working spare disk in codfw - https://phabricator.wikimedia.org/T112783#1645923 (10RobH) 3NEW a:3mark [17:36:09] (03PS1) 10Addshore: Rsync api log archives from fluorine to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/238798 (https://bugzilla.wikimedia.org/112744) [17:36:36] 6operations, 10MediaWiki-extensions-TimedMediaHandler, 6Multimedia, 7HHVM, 5Patch-For-Review: Convert tmh100[12] to HHVM and trusty - https://phabricator.wikimedia.org/T104747#1645937 (10Joe) As of now, all tmh jobs are being handled by the newer trusty-based host. To change this it's sufficient to reve... [17:36:53] <_joe_> ok, now I can call it a day :) [17:37:13] 6operations, 10Analytics: Moving analysis data from flourine to analytics cluster - https://phabricator.wikimedia.org/T112744#1645938 (10Legoktm) [17:37:13] (03PS2) 10Addshore: Rsync api log archives from fluorine to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/238798 (https://phabricator.wikimedia.org/T112744) [17:38:25] 6operations, 10Analytics, 5Patch-For-Review: Moving analysis data from flourine to analytics cluster - https://phabricator.wikimedia.org/T112744#1645942 (10Addshore) This will actually result in roughly 2.4T currently as the retention is 90 days on stat1002 [17:38:42] accepted! [17:39:01] did, sorry, thanks mutante [17:39:37] ottomata: cool, was just for a ticket about cleaning them all [17:40:08] PROBLEM - Disk space on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:40:08] PROBLEM - Disk space on analytics1026 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:40:18] PROBLEM - Disk space on analytics1027 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [17:40:18] PROBLEM - Hadoop Namenode - Primary on analytics1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode [17:40:45] <_joe_> ottomata: uhm some analytics failures I see [17:40:51] <_joe_> heavvy load maybe [17:41:05] uhhhhh [17:41:09] those ones! ^^^ [17:41:09] ? [17:41:16] none of those nodes have much load [17:41:18] looking [17:41:26] oh 1001 [17:41:27] uhh [17:42:03] 1001 in standby whaa [17:42:10] RECOVERY - Hadoop Namenode - Primary on analytics1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode [17:42:19] eek hm [17:42:22] ... [17:42:25] i didn't do that [17:42:29] RECOVERY - Disk space on labstore1002 is OK: DISK OK [17:42:36] still in standby, looks like the namenode restarted.. [17:42:58] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 13.33% of data above the critical threshold [500.0] [17:45:29] RECOVERY - Disk space on analytics1026 is OK: DISK OK [17:45:29] RECOVERY - Disk space on stat1002 is OK: DISK OK [17:45:38] RECOVERY - Disk space on analytics1027 is OK: DISK OK [17:48:55] 6operations, 10ops-codfw: msw-c1-codfw dead? - https://phabricator.wikimedia.org/T112786#1646003 (10faidon) 3NEW [17:49:02] chasemp: ^ may affect you [17:49:11] elastic2013-2015 are on that rack I see [17:49:30] ottomata: https://phabricator.wikimedia.org/T111053#1644653 [17:49:33] 6operations, 10ops-codfw: Humidity Alarms from codfw - https://phabricator.wikimedia.org/T110421#1646018 (10RobH) I've emailed CyrusOne support directly and CC'd Papaul. I didn't include the task, since it may contain private data from support. I'll update this task accordingly. Email to support was as foll... [17:50:07] PROBLEM - puppet last run on analytics1001 is CRITICAL: CRITICAL: Puppet has 1 failures [17:53:28] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [17:55:06] heh, what's happenning? [17:55:18] K thanks paravoid [17:55:39] just broken shit everywhere [17:55:57] puppet is fine! geez [17:56:17] 3 different switches apparently [17:56:30] and a router bug [17:57:08] RECOVERY - puppet last run on analytics1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:57:10] ok wow [17:57:10] yeah [17:57:29] would make sense, namenode died because of loss of connection to journalnodes for a sec [17:57:31] its ok now then [18:00:04] twentyafterfour: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150916T1800). [18:08:17] (03PS1) 10Andrew Bogott: Openstack: Clarify differences between the controller and the spare: [puppet] - 10https://gerrit.wikimedia.org/r/238802 [18:09:33] (03CR) 10Dzahn: "compiler says it does not influence anything on carbon." [puppet] - 10https://gerrit.wikimedia.org/r/238525 (https://phabricator.wikimedia.org/T111225) (owner: 10Dzahn) [18:11:54] (03PS2) 10Andrew Bogott: Openstack: Clarify differences between the controller and the spare: [puppet] - 10https://gerrit.wikimedia.org/r/238802 [18:12:27] (03PS2) 10Dzahn: reprepro: switch default_distro to jessie [puppet] - 10https://gerrit.wikimedia.org/r/238525 (https://phabricator.wikimedia.org/T111225) [18:14:30] (03CR) 10Dzahn: [C: 032] "carbon (apt.wm) just has the installserver role, and that does not use the reprepro module.also confirmed with compiler, link above" [puppet] - 10https://gerrit.wikimedia.org/r/238525 (https://phabricator.wikimedia.org/T111225) (owner: 10Dzahn) [18:14:55] (03PS3) 10Andrew Bogott: Openstack: Clarify differences between the controller and the spare: [puppet] - 10https://gerrit.wikimedia.org/r/238802 [18:16:32] (03PS6) 10Milimetric: Add an Analytics specific instance of RESTBase [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) [18:16:45] (03CR) 10Milimetric: Add an Analytics specific instance of RESTBase (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/231574 (https://phabricator.wikimedia.org/T107056) (owner: 10Milimetric) [18:18:21] (03PS4) 10Andrew Bogott: Openstack: Clarify differences between the controller and the spare: [puppet] - 10https://gerrit.wikimedia.org/r/238802 [18:18:33] 6operations, 5Patch-For-Review: Change distribution in releases.wikimedia.org to "sid" or "jessie" - https://phabricator.wikimedia.org/T111225#1646156 (10Dzahn) @cscott @gwicke I also merged that change now. The default has been switched. Wanna try uploading a new release? [18:19:58] (03PS5) 10Andrew Bogott: Openstack: Clarify differences between the controller and the spare: [puppet] - 10https://gerrit.wikimedia.org/r/238802 [18:20:58] (03PS8) 10Rush: Disable dynamic scripting in Elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/224651 (owner: 10Manybubbles) [18:21:11] (03CR) 10Andrew Bogott: [C: 032] Openstack: Clarify differences between the controller and the spare: [puppet] - 10https://gerrit.wikimedia.org/r/238802 (owner: 10Andrew Bogott) [18:23:54] 6operations, 5Patch-For-Review: Change distribution in releases.wikimedia.org to "sid" or "jessie" - https://phabricator.wikimedia.org/T111225#1646176 (10Dzahn) a:3Dzahn [18:24:02] (03PS1) 10Andrew Bogott: Openstack: s/hostname/fqdn/g in the controller vs. spare logic [puppet] - 10https://gerrit.wikimedia.org/r/238805 [18:24:10] (03CR) 10jenkins-bot: [V: 04-1] Openstack: s/hostname/fqdn/g in the controller vs. spare logic [puppet] - 10https://gerrit.wikimedia.org/r/238805 (owner: 10Andrew Bogott) [18:24:22] (03PS2) 10Andrew Bogott: Openstack: s/hostname/fqdn/g in the controller vs. spare logic [puppet] - 10https://gerrit.wikimedia.org/r/238805 [18:25:26] (03CR) 10Andrew Bogott: [C: 032] Openstack: s/hostname/fqdn/g in the controller vs. spare logic [puppet] - 10https://gerrit.wikimedia.org/r/238805 (owner: 10Andrew Bogott) [18:25:46] ok I'm going to deploy the train if there are no objections [18:27:20] 6operations: mystery palladium unaccepted salt keys - https://phabricator.wikimedia.org/T112767#1646210 (10Dzahn) kafka1022 fixed by ottomata [18:27:41] 6operations: mystery palladium unaccepted salt keys - https://phabricator.wikimedia.org/T112767#1646241 (10Dzahn) [18:29:36] (03PS4) 1020after4: SSH repo hosting support for phabricator. [puppet] - 10https://gerrit.wikimedia.org/r/237096 (https://phabricator.wikimedia.org/T128) [18:31:21] 6operations: mystery palladium unaccepted salt keys - https://phabricator.wikimedia.org/T112767#1646292 (10Krenair) rbf2002.codfw.wmnet - sounds like this can be deleted per {T95153} [18:32:23] (03PS1) 10BBlack: disable lvs::balancer on new eqiad LVS for now [puppet] - 10https://gerrit.wikimedia.org/r/238807 (https://phabricator.wikimedia.org/T104458) [18:32:36] (03CR) 10BBlack: [C: 032 V: 032] disable lvs::balancer on new eqiad LVS for now [puppet] - 10https://gerrit.wikimedia.org/r/238807 (https://phabricator.wikimedia.org/T104458) (owner: 10BBlack) [18:33:05] we need to add eevans to the 'services' group in gerrit, but don't have the rights to do so ourselves: https://gerrit.wikimedia.org/r/#/admin/groups/630,members [18:33:19] any gerrit admins around ^^ ? [18:33:39] (03CR) 1020after4: [C: 032] Use HHVM's logger in its sudo_check_call [tools/scap] - 10https://gerrit.wikimedia.org/r/238782 (owner: 10Chad) [18:34:03] (03Merged) 10jenkins-bot: Use HHVM's logger in its sudo_check_call [tools/scap] - 10https://gerrit.wikimedia.org/r/238782 (owner: 10Chad) [18:34:16] (03PS1) 10BBlack: include standard temporarily for new eqiad LVS [puppet] - 10https://gerrit.wikimedia.org/r/238811 (https://phabricator.wikimedia.org/T104458) [18:34:26] (03CR) 10BBlack: [C: 032 V: 032] include standard temporarily for new eqiad LVS [puppet] - 10https://gerrit.wikimedia.org/r/238811 (https://phabricator.wikimedia.org/T104458) (owner: 10BBlack) [18:34:28] (03PS1) 10Andrew Bogott: OpenStack: Turn off keystone service and cron on spare controller. [puppet] - 10https://gerrit.wikimedia.org/r/238812 [18:35:03] PROBLEM - Tool Labs instance distribution on labcontrol1002 is CRITICAL: CRITICAL: master class instances not spread out enough [18:35:10] (03CR) 1020after4: [C: 031] sudo_check_call: Improve logging on failures [tools/scap] - 10https://gerrit.wikimedia.org/r/238786 (owner: 10Chad) [18:35:48] (03PS2) 10Andrew Bogott: OpenStack: Turn off keystone service and cron on spare controller. [puppet] - 10https://gerrit.wikimedia.org/r/238812 [18:36:24] 6operations, 10ops-eqiad, 10netops: asw-d-eqiad SNMP failures - https://phabricator.wikimedia.org/T112781#1646330 (10faidon) I can say pretty conclusively that the removal of those SFP+ fixed this issue: LibreNMS has been collecting metrics just fine for about an hour. As to how to move forward: first off,... [18:37:02] (03CR) 10Andrew Bogott: [C: 032] OpenStack: Turn off keystone service and cron on spare controller. [puppet] - 10https://gerrit.wikimedia.org/r/238812 (owner: 10Andrew Bogott) [18:37:39] 6operations: mystery palladium unaccepted salt keys - https://phabricator.wikimedia.org/T112767#1646341 (10Dzahn) rbf2002 - deleted, because [[ https://wikitech.wikimedia.org/wiki/Infrastructure_naming_conventions | naming conventions ]] says rbf is unused. And there is also no rbf host in site.pp [18:38:10] 6operations: mystery palladium unaccepted salt keys - https://phabricator.wikimedia.org/T112767#1646344 (10Dzahn) [18:38:35] !log twentyafterfour@tin Synchronized php-1.26wmf23: syncing wmf23 ahead of deployment to group1 (duration: 01m 35s) [18:38:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:41:17] gwicke: I think I can do it [18:41:40] done [18:41:42] urandom: ^ [18:41:53] paravoid: thanks! [18:42:11] I'm confused on when we use LDAP groups (which Gerrit supports) and when Gerrit native groups honestly [18:42:20] but this was way too easy, so whatever :P [18:42:35] 503s on test.wp.org [18:42:57] due to deploy ? [18:43:23] (03PS8) 10Greg Grossmeier: Allow top-level logger to track lower level git operations [tools/scap] - 10https://gerrit.wikimedia.org/r/238777 (https://phabricator.wikimedia.org/T109858) (owner: 10Chad) [18:43:59] (03PS3) 10Greg Grossmeier: sudo_check_call: Improve logging on failures [tools/scap] - 10https://gerrit.wikimedia.org/r/238786 (https://phabricator.wikimedia.org/T109858) (owner: 10Chad) [18:46:41] (03PS1) 1020after4: group1 wikis to 1.26wmf23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238816 [18:46:59] (03CR) 1020after4: [C: 032] group1 wikis to 1.26wmf23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238816 (owner: 1020after4) [18:47:05] (03Merged) 10jenkins-bot: group1 wikis to 1.26wmf23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238816 (owner: 1020after4) [18:47:24] !log twentyafterfour@tin rebuilt wikiversions.cdb and synchronized wikiversions files: group1 wikis to 1.26wmf23 [18:47:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [18:50:23] thedj: I don't know. looking [18:50:55] thedj: what url 503s? [18:52:25] https://test.wikipedia.org/wiki/Page724 [18:52:27] seems gone now [18:54:09] 6operations, 10ops-eqiad, 10Traffic, 10netops, 5Patch-For-Review: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1646429 (10BBlack) lvs1012 is now somewhat-up and somewhat-puppetized, and the 1GB connection to Row B seems to work correctly. I still need to go through all t... [18:54:40] (03PS1) 10Papaul: Add DNS entries for restbase200[1-6] Bug:T112683 [dns] - 10https://gerrit.wikimedia.org/r/238818 (https://phabricator.wikimedia.org/T112683) [18:59:19] (03CR) 10Smalyshev: "Any other comments/objections/suggestions? If not, I'll submit it to Puppet SWAT." [puppet] - 10https://gerrit.wikimedia.org/r/230483 (https://phabricator.wikimedia.org/T97195) (owner: 10Smalyshev) [19:00:19] (03CR) 10Chad: SSH repo hosting support for phabricator. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/237096 (https://phabricator.wikimedia.org/T128) (owner: 1020after4) [19:01:04] (03CR) 10Chad: "Bump. What's the status here?" [tools/scap] - 10https://gerrit.wikimedia.org/r/224629 (owner: 10Ori.livneh) [19:03:34] (03CR) 10Rush: SSH repo hosting support for phabricator. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/237096 (https://phabricator.wikimedia.org/T128) (owner: 1020after4) [19:04:48] (03CR) 10Rush: [C: 031] "seems reasonable to me" [puppet] - 10https://gerrit.wikimedia.org/r/224651 (owner: 10Manybubbles) [19:06:20] 6operations, 10ops-codfw: setup/install/deploy new HP restbase servers for codfw - https://phabricator.wikimedia.org/T112683#1646487 (10Papaul) Servers racking complete Rack table updated Asset tag in place physical label in place BIOS settings updated restbase2001 : b5 ge-5/0/29 restbase2002 : b8 ge-8/0/8... [19:06:37] PROBLEM - configured eth on lvs1012 is CRITICAL: eth3 reporting no carrier. [19:09:43] (03Abandoned) 10Ori.livneh: Expect l10n_cache-en.php, not l10n_cache-en.cdb [tools/scap] - 10https://gerrit.wikimedia.org/r/224629 (owner: 10Ori.livneh) [19:09:52] there are a whole lot of errors like "Sep 16 19:06:06 mw1116: #012Warning: Unknown modifier '\': [([^\s,]+)\s*=\s*([^\s,]+)[\+\-]]" in the logs... [19:10:25] (03CR) 10Chad: [C: 032 V: 032] Add jar for BouncyCastle 1.44 from Debian wheezy [gerrit/plugins] - 10https://gerrit.wikimedia.org/r/237918 (https://phabricator.wikimedia.org/T112025) (owner: 10QChris) [19:10:48] 6operations, 10ops-codfw: msw-c1-codfw dead? - https://phabricator.wikimedia.org/T112786#1646509 (10Papaul) Yes that is why i was having problem with restbase2003 I couldn't ping it. [19:11:13] ori: Thx for update on that patch. Just trying to keep our backlog of unreviewed patches moving :) [19:11:30] (for scap, heh) [19:13:55] 6operations, 10Wikimedia-Git-or-Gerrit, 5Patch-For-Review: Wikimedia Gerrit doesn't work if OpenSSH version is higher than 7.0 - https://phabricator.wikimedia.org/T112025#1646515 (10Paladox) Does this mean we doint need the workaround now since the patch was merged. [19:14:35] (03PS1) 10Chad: Avoid race condition where lock file disappeared from under us [tools/scap] - 10https://gerrit.wikimedia.org/r/238828 [19:15:07] ostriches: np, sorry for letting it stall. [19:16:13] 6operations, 10Datasets-General-or-Unknown, 7HHVM: Convert snapshot hosts to use HHVM and trusty - https://phabricator.wikimedia.org/T94277#1646517 (10ori) >>! In T94277#1615702, @ArielGlenn wrote: > So expect action on this around September 15, earlier if the run of the dumps finishes earlier. September 1... [19:16:20] (03CR) 10Chad: "Saw this manifest as https://integration.wikimedia.org/ci/job/beta-scap-eqiad/70310/console" [tools/scap] - 10https://gerrit.wikimedia.org/r/238828 (owner: 10Chad) [19:16:22] 6operations, 10ops-codfw: msw-c1-codfw dead? - https://phabricator.wikimedia.org/T112786#1646518 (10Papaul) resetting the switch fix the problem. [19:17:45] 6operations, 10ops-codfw: msw-c1-codfw dead? - https://phabricator.wikimedia.org/T112786#1646519 (10faidon) 5Open>3Resolved a:3faidon [19:17:56] (03CR) 1020after4: [C: 032] Avoid race condition where lock file disappeared from under us [tools/scap] - 10https://gerrit.wikimedia.org/r/238828 (owner: 10Chad) [19:18:11] (03Merged) 10jenkins-bot: Avoid race condition where lock file disappeared from under us [tools/scap] - 10https://gerrit.wikimedia.org/r/238828 (owner: 10Chad) [19:18:53] 6operations, 10ops-eqiad, 10Traffic, 10netops, 5Patch-For-Review: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1646522 (10Cmjohnson) Added GigE link to lvs1009...connected asw-b-eqiad ge-8/0/45 with cable number of 3110 [19:19:39] 6operations, 10ops-codfw: setup/install/deploy new HP restbase servers for codfw - https://phabricator.wikimedia.org/T112683#1646523 (10Papaul) I couldn't ping or ssh restbase2003 because the switch had a problem i had to reset the switch now everything is working. [19:22:16] 6operations, 10ops-codfw, 10netops: cr1-eqdfw PEM 0 failure - https://phabricator.wikimedia.org/T110435#1646532 (10Papaul) I have received the replacement part for cr-eqdfw. I will be going on site tomorrow at 10:00am to replace the faulty part. [19:22:27] PROBLEM - puppet last run on analytics1015 is CRITICAL: CRITICAL: puppet fail [19:24:16] RECOVERY - puppet last run on analytics1015 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:26:45] (03PS1) 10Ottomata: Enable error logging for Hive/Oozie meta mysql, set binlog_format=ROW [puppet] - 10https://gerrit.wikimedia.org/r/238832 (https://phabricator.wikimedia.org/T110090) [19:26:57] (03CR) 10Hashar: [C: 04-1] "I find it very horrible to have to pass the logger everywhere. The cli.Application class takes care of setting up the root logger which i" [tools/scap] - 10https://gerrit.wikimedia.org/r/238777 (https://phabricator.wikimedia.org/T109858) (owner: 10Chad) [19:27:22] 6operations: Change Google Webmaster password for noc@ - https://phabricator.wikimedia.org/T110951#1646552 (10Jalexander) >>! In T110951#1645862, @Dzahn wrote: > @Jalexander How about this: You reset the password and then save it in a text file in your home directory on a WMF server,owned by just you. Then you... [19:28:41] 6operations, 10ops-codfw: ms-be2006.codfw.wmnet: slot=6 dev=sdg failed - https://phabricator.wikimedia.org/T112242#1646562 (10Papaul) a:5Papaul>3fgiunchedi Driver replacement complete [19:28:42] (03CR) 10Jcrespo: [C: 031] Enable error logging for Hive/Oozie meta mysql, set binlog_format=ROW [puppet] - 10https://gerrit.wikimedia.org/r/238832 (https://phabricator.wikimedia.org/T110090) (owner: 10Ottomata) [19:29:03] 6operations, 10ops-codfw: ms-be2006.codfw.wmnet: slot=6 dev=sdg failed - https://phabricator.wikimedia.org/T112242#1646568 (10Papaul) Drive replacement complete [19:29:04] (03CR) 10Ottomata: [C: 032] Enable error logging for Hive/Oozie meta mysql, set binlog_format=ROW [puppet] - 10https://gerrit.wikimedia.org/r/238832 (https://phabricator.wikimedia.org/T110090) (owner: 10Ottomata) [19:33:52] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1646579 (10ellery) I'll respond later with a more detailed report but here are some initial results based on some... [19:36:18] (03PS1) 10Dduvall: Support atomic promotion and rollback [tools/scap] - 10https://gerrit.wikimedia.org/r/238839 (https://phabricator.wikimedia.org/T109514) [19:36:50] (03PS1) 10Andrew Bogott: s/hostname/fqdn/g in the controller vs. spare logic, again [puppet] - 10https://gerrit.wikimedia.org/r/238841 [19:38:00] (03CR) 10Andrew Bogott: [C: 032] s/hostname/fqdn/g in the controller vs. spare logic, again [puppet] - 10https://gerrit.wikimedia.org/r/238841 (owner: 10Andrew Bogott) [19:38:35] !log Deployed statsv 0bfd9f06f / change I050a12d3b [19:38:38] ^ Krinkle [19:38:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [19:39:38] ori: Thx [19:42:15] (03CR) 10Dduvall: "Note that to test this patch with scap-vagrant, you'll need to pull down the latest version, nuke your containers (`lxc-ls | while read c;" [tools/scap] - 10https://gerrit.wikimedia.org/r/238839 (https://phabricator.wikimedia.org/T109514) (owner: 10Dduvall) [19:45:38] PROBLEM - puppet last run on mw2112 is CRITICAL: CRITICAL: puppet fail [19:54:32] (03CR) 10RobH: [C: 032] Add DNS entries for restbase200[1-6] Bug:T112683 [dns] - 10https://gerrit.wikimedia.org/r/238818 (https://phabricator.wikimedia.org/T112683) (owner: 10Papaul) [20:00:04] gwicke cscott arlolra subbu mdholloway: Dear anthropoid, the time has come. Please deploy Services – Parsoid / OCG / Citoid / Mobileapps / … (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150916T2000). [20:00:14] no parsoid deploy today. [20:10:36] (03CR) 10Hashar: [C: 04-1] "You are appending a newline while it has been strip() just a line above." (032 comments) [tools/scap] - 10https://gerrit.wikimedia.org/r/238786 (https://phabricator.wikimedia.org/T109858) (owner: 10Chad) [20:14:37] RECOVERY - puppet last run on mw2112 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [20:15:32] (03PS1) 10EBernhardson: [elasticsearch] Update recover_after_nodes value [puppet] - 10https://gerrit.wikimedia.org/r/238850 [20:16:34] (03PS2) 10EBernhardson: [elasticsearch] Update recover_after_nodes value [puppet] - 10https://gerrit.wikimedia.org/r/238850 [20:16:54] 6operations, 10ops-codfw: setup/install/deploy new HP restbase servers for codfw - https://phabricator.wikimedia.org/T112683#1646793 (10RobH) @papaul: I've gone ahead and set all the vlan and port descriptions, so you should be ok to continue along and put in the production dns entries. [20:18:09] (03PS3) 10EBernhardson: [elasticsearch] Update recover_after_nodes value [puppet] - 10https://gerrit.wikimedia.org/r/238850 [20:29:41] (03PS1) 10QChris: Update delete-project plugin to the deployed version [gerrit/plugins] - 10https://gerrit.wikimedia.org/r/238857 [20:31:21] (03PS1) 10Chad: Fix logging output from sudo_check_call [tools/scap] - 10https://gerrit.wikimedia.org/r/238858 [20:31:38] (03PS1) 10Papaul: Add MAC address entries for restbase200[1-6] Bug:T112683 [puppet] - 10https://gerrit.wikimedia.org/r/238859 (https://phabricator.wikimedia.org/T112683) [20:33:33] (03PS1) 10Andrew Bogott: Added tests for grid job submission. [puppet] - 10https://gerrit.wikimedia.org/r/238863 (https://phabricator.wikimedia.org/T97748) [20:35:38] 6operations, 10ops-eqiad, 10Traffic, 10netops, 5Patch-For-Review: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1646856 (10Cmjohnson) ge-8/0/45 up down <> ge-8/0/46 up up <> [20:36:47] (03CR) 10Chad: [C: 04-2] sudo_check_call: Improve logging on failures [tools/scap] - 10https://gerrit.wikimedia.org/r/238786 (https://phabricator.wikimedia.org/T109858) (owner: 10Chad) [20:36:51] (03CR) 10Chad: [C: 04-2] Allow top-level logger to track lower level git operations [tools/scap] - 10https://gerrit.wikimedia.org/r/238777 (https://phabricator.wikimedia.org/T109858) (owner: 10Chad) [20:36:58] (03PS2) 10Andrew Bogott: Added tests for grid job submission. [puppet] - 10https://gerrit.wikimedia.org/r/238863 (https://phabricator.wikimedia.org/T97748) [20:37:14] (03CR) 10Hashar: [C: 04-1] "Atomic deploy and rollback. That is a killer feature :-}" (033 comments) [tools/scap] - 10https://gerrit.wikimedia.org/r/238839 (https://phabricator.wikimedia.org/T109514) (owner: 10Dduvall) [20:40:28] Is there a Trusty equivalent of terbium for misc maintenance work? [20:40:53] 6operations: mystery palladium unaccepted salt keys - https://phabricator.wikimedia.org/T112767#1646888 (10jcrespo) [20:42:02] 6operations: mystery palladium unaccepted salt keys - https://phabricator.wikimedia.org/T112767#1646899 (10chasemp) 5Open>3Resolved a:3chasemp you guys made quick work of this, nice [20:43:07] 6operations: mystery palladium unaccepted salt keys - https://phabricator.wikimedia.org/T112767#1646909 (10jcrespo) "And I would have gotten away with it, too, if it weren't for you meddling kids!" [20:47:48] !log updated OCG to version 4032a596ce6eb442b02cc6ee9b79263b1eb23275 [20:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [20:48:21] 6operations, 10Analytics, 10Analytics-Cluster, 10Fundraising Tech Backlog, 10Fundraising-Backlog: Verify kafkatee use for fundraising logs on erbium - https://phabricator.wikimedia.org/T97676#1646933 (10Ottomata) Ellery, call to your Senator and tell her you want a realtime streaming analytics cluster. [20:51:56] PROBLEM - YARN NodeManager Node-State on analytics1029 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:52:58] PROBLEM - Disk space on analytics1027 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:53:07] PROBLEM - Disk space on stat1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:53:17] PROBLEM - Hadoop Namenode - Primary on analytics1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode [20:53:26] PROBLEM - Disk space on analytics1026 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [20:53:37] RECOVERY - YARN NodeManager Node-State on analytics1029 is OK: OK: YARN NodeManager analytics1029.eqiad.wmnet:8041 Node-State: RUNNING [20:54:57] ottomata: the Hadoop on analytics1001 paged [20:56:03] gah [20:56:09] that one doesnt look like a timeout at the others [20:56:18] (03PS1) 10Chad: Simplify logging in ssh module [tools/scap] - 10https://gerrit.wikimedia.org/r/238959 [20:56:19] s/at/like [20:56:22] yeah, no mutante, there is some network hiccup [20:56:31] that is causing namenode to miss talking to its journalnodes [20:56:34] so it dies [20:56:39] (03PS1) 10Andrew Bogott: Add check for /public/dumps [puppet] - 10https://gerrit.wikimedia.org/r/238960 (https://phabricator.wikimedia.org/T97748) [20:57:05] is the fix just start service hadoop? [20:57:15] that, and also it needs promoted to active namenode [20:57:20] since we don't do auto failover [20:57:34] https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hadoop/Administration#Transition_to_Active [20:57:51] ah:) [20:57:53] this is the second time it happened today [20:57:54] and is not [20:57:55] good [20:58:01] which one do we want to be active [20:58:18] 1001 [20:58:38] RECOVERY - Hadoop Namenode - Primary on analytics1001 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.namenode.NameNode [20:58:45] should i do this? ah. ok [20:58:53] just did. sorry, maybe i shoulda let you so some others know [20:58:55] mutante: all i did was [20:58:56] RECOVERY - Disk space on analytics1026 is OK: DISK OK [20:59:03] sudo service hadoop-hdfs-namenode start [20:59:04] more pagess weee [20:59:11] (puppet would restart it if we waited long enough) [20:59:18] then wait about a min for namenode to exist safemode [20:59:19] then [20:59:21] sudo -u hdfs /usr/bin/hdfs haadmin -transitionToActive analytics1001-eqiad-wmnet [20:59:32] ok, yep, i was about to paste the same [20:59:47] you can check it with sudo -u hdfs /usr/bin/hdfs haadmin -getServiceState analytics1001-eqiad-wmnet [21:00:03] it says "active" [21:00:08] RECOVERY - Disk space on analytics1027 is OK: DISK OK [21:00:17] RECOVERY - Disk space on stat1002 is OK: DISK OK [21:00:23] so it is already set? [21:00:38] (03Abandoned) 10Chad: Allow top-level logger to track lower level git operations [tools/scap] - 10https://gerrit.wikimedia.org/r/238777 (https://phabricator.wikimedia.org/T109858) (owner: 10Chad) [21:01:09] (03PS1) 10Ottomata: Make sure python-pykafka is installed for eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/238962 [21:01:11] (03PS1) 10Ottomata: Deploy eventlogging code to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/238963 (https://phabricator.wikimedia.org/T112660) [21:01:27] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: Puppet has 1 failures [21:01:32] (03Abandoned) 10Chad: sudo_check_call: Improve logging on failures [tools/scap] - 10https://gerrit.wikimedia.org/r/238786 (https://phabricator.wikimedia.org/T109858) (owner: 10Chad) [21:02:09] ottomata: ok, confirmed, "active" [21:02:19] RFC meeting starting now in #wikimedia-office regarding MySQL [21:02:22] (03CR) 10QChris: [C: 031] Update delete-project plugin to the deployed version [gerrit/plugins] - 10https://gerrit.wikimedia.org/r/238857 (owner: 10QChris) [21:03:18] (03CR) 10Chad: [C: 032 V: 032] Update delete-project plugin to the deployed version [gerrit/plugins] - 10https://gerrit.wikimedia.org/r/238857 (owner: 10QChris) [21:03:28] thanks mutante [21:03:46] PROBLEM - puppet last run on mw2037 is CRITICAL: CRITICAL: Puppet has 1 failures [21:03:47] (03CR) 10Ottomata: [C: 032] Make sure python-pykafka is installed for eventlogging [puppet] - 10https://gerrit.wikimedia.org/r/238962 (owner: 10Ottomata) [21:03:56] (03CR) 10Ottomata: [C: 032] Deploy eventlogging code to stat1002 [puppet] - 10https://gerrit.wikimedia.org/r/238963 (https://phabricator.wikimedia.org/T112660) (owner: 10Ottomata) [21:04:40] 6operations, 10Wikimedia-Site-Requests: Move the wiki of WMEE - https://phabricator.wikimedia.org/T31919#1646997 (10Krenair) 5Open>3stalled a:5Krenair>3None [21:04:42] 6operations, 10Wikimedia-Site-Requests, 7I18n, 7Tracking: Wikis waiting to be renamed (tracking) - https://phabricator.wikimedia.org/T21986#1647000 (10Krenair) [21:05:03] (03Abandoned) 10Alex Monk: Create ee.wikimedia.org for renaming from et.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/234426 (https://phabricator.wikimedia.org/T31919) (owner: 10Alex Monk) [21:05:08] (03Abandoned) 10Alex Monk: Add ee.wikimedia.org to apache config for chapters [puppet] - 10https://gerrit.wikimedia.org/r/234427 (https://phabricator.wikimedia.org/T31919) (owner: 10Alex Monk) [21:06:47] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:13:43] (03PS1) 10Ottomata: Make main eventlogging class require eventlogging::package [puppet] - 10https://gerrit.wikimedia.org/r/238967 [21:14:45] (03CR) 10Hashar: [C: 04-1] "Almost! :-}" (032 comments) [tools/scap] - 10https://gerrit.wikimedia.org/r/238858 (owner: 10Chad) [21:15:24] (03CR) 10Ottomata: [C: 032] Make main eventlogging class require eventlogging::package [puppet] - 10https://gerrit.wikimedia.org/r/238967 (owner: 10Ottomata) [21:15:28] PROBLEM - puppet last run on stat1002 is CRITICAL: CRITICAL: puppet fail [21:18:43] (03CR) 10Hashar: [C: 031] "Should be fine." [tools/scap] - 10https://gerrit.wikimedia.org/r/238959 (owner: 10Chad) [21:18:57] RECOVERY - puppet last run on stat1002 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [21:19:42] (03PS2) 10Chad: Fix logging output from sudo_check_call [tools/scap] - 10https://gerrit.wikimedia.org/r/238858 [21:22:07] (03CR) 10Chad: "While it does attach to the root logger, it doesn't attach to the logger I want...the one for our cli.Application. Per IRC, we're likely t" [tools/scap] - 10https://gerrit.wikimedia.org/r/238959 (owner: 10Chad) [21:22:10] (03PS1) 10Ottomata: Revert change to always include python-pykafka [puppet] - 10https://gerrit.wikimedia.org/r/238972 [21:22:21] (03CR) 10Ottomata: [C: 032 V: 032] Revert change to always include python-pykafka [puppet] - 10https://gerrit.wikimedia.org/r/238972 (owner: 10Ottomata) [21:23:18] PROBLEM - puppet last run on hafnium is CRITICAL: CRITICAL: Puppet has 1 failures [21:23:19] (03CR) 1020after4: [C: 031] Fix logging output from sudo_check_call [tools/scap] - 10https://gerrit.wikimedia.org/r/238858 (owner: 10Chad) [21:24:27] (03CR) 1020after4: "I will write up the context manager to handle this." [tools/scap] - 10https://gerrit.wikimedia.org/r/238959 (owner: 10Chad) [21:25:07] RECOVERY - puppet last run on hafnium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:26:17] (03PS3) 10QChris: Make gerrit offer newer key exchange algorithms for new sshs [puppet] - 10https://gerrit.wikimedia.org/r/237753 (https://phabricator.wikimedia.org/T112025) [21:26:19] (03PS1) 10QChris: Ensure gerrit's plugins are kept in sync with plugin repo [puppet] - 10https://gerrit.wikimedia.org/r/238976 [21:26:39] (03CR) 10QChris: [C: 04-1] "Before this change can get merged," [puppet] - 10https://gerrit.wikimedia.org/r/238976 (owner: 10QChris) [21:29:45] GET //_cluster/health?pretty=true HTTP/1.1 [21:30:44] are you ok, logmsgbot [21:31:57] RECOVERY - puppet last run on mw2037 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [21:36:18] (03PS1) 10Rush: elastic: update check_elasticsearch.py [puppet] - 10https://gerrit.wikimedia.org/r/238977 [21:38:27] (03PS1) 10Gergő Tisza: Enable authmetrics logging everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/238978 (https://phabricator.wikimedia.org/T91701) [21:39:31] (03PS2) 10Rush: elastic: update check_elasticsearch.py [puppet] - 10https://gerrit.wikimedia.org/r/238977 [21:40:41] (03CR) 10Dduvall: [C: 031] Fix logging output from sudo_check_call [tools/scap] - 10https://gerrit.wikimedia.org/r/238858 (owner: 10Chad) [21:45:15] (03CR) 10Rush: [C: 032] elastic: update check_elasticsearch.py [puppet] - 10https://gerrit.wikimedia.org/r/238977 (owner: 10Rush) [21:48:16] PROBLEM - DPKG on mendelevium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:48:26] PROBLEM - Check size of conntrack table on mendelevium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:48:27] PROBLEM - Disk space on mendelevium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:48:48] PROBLEM - RAID on mendelevium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:48:58] PROBLEM - configured eth on mendelevium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:49:07] PROBLEM - salt-minion processes on mendelevium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:49:12] 6operations, 10ops-codfw, 10netops: setup wifi in codfw - https://phabricator.wikimedia.org/T86541#1647123 (10RobH) Is there any further movement on getting Papaul setup with wifi access in our cage? When working with him day to day, his IRC client disconnects constantly (mifi access in the cage is spotty). [21:49:37] PROBLEM - dhclient process on mendelevium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [21:57:20] mendelevium is a new VM for OTRS, not in prod [21:59:12] (03PS2) 10Rush: elastic: define codfw lvs [puppet] - 10https://gerrit.wikimedia.org/r/238507 [22:00:29] 6operations, 10Wikimedia-Site-Requests: Move the wiki of WMEE - https://phabricator.wikimedia.org/T31919#1647161 (10Ricordisamoa) [22:02:53] (03CR) 10BBlack: [C: 031] "Looks sane to me" [puppet] - 10https://gerrit.wikimedia.org/r/238507 (owner: 10Rush) [22:05:56] PROBLEM - puppet last run on analytics1029 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:06:17] PROBLEM - configured eth on analytics1029 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:06:37] PROBLEM - YARN NodeManager Node-State on analytics1029 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:06:38] PROBLEM - Disk space on analytics1029 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:06:49] 6operations, 10ops-codfw, 10netops: cr1-eqdfw PEM 0 failure - https://phabricator.wikimedia.org/T110435#1647186 (10RobH) Papaul requested the output of a few info gathering commands. I'm not sure how private we need to be about all these serials, so I'm defaulting to paranoia. I setup the paste for just hi... [22:06:57] PROBLEM - dhclient process on analytics1029 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:06:59] PROBLEM - Check size of conntrack table on analytics1029 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:07:00] PROBLEM - salt-minion processes on analytics1029 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:07:00] PROBLEM - RAID on analytics1029 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:07:07] PROBLEM - Disk space on Hadoop worker on analytics1029 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:07:07] PROBLEM - DPKG on analytics1029 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:07:07] PROBLEM - SSH on analytics1029 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:07:26] PROBLEM - Hadoop DataNode on analytics1029 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:07:33] hm [22:07:33] 6operations, 7Mail: Remove Alias for sj@wm.o - https://phabricator.wikimedia.org/T108276#1647193 (10Dzahn) Hello @sj done, i have removed your wikimedia.org email address from the remaining list and then added you with your gmail.com address (you were already member in most), except chaptercommittee-l which y... [22:07:37] PROBLEM - Hadoop NodeManager on analytics1029 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [22:08:13] 6operations, 10ops-eqiad, 10Traffic, 10netops, 5Patch-For-Review: rack/setup new eqiad lvs machines - https://phabricator.wikimedia.org/T104458#1647194 (10BBlack) I think, unless @faidon has any objection, we'll run the other 4 row B ports (eth2 of 7, 8, 10, 11) to the 4x 10Gb ports at asw-b xe-5/1/x (b5... [22:08:22] !log powercycling analytcis1029, it is down? [22:08:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:08:53] !log disabling puppet in RESTBase eqiad staging cluster to test new code and config [22:08:56] hm, or maybe its not, and this is just more network weirdness [22:08:57] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 9.09% of data above the critical threshold [500.0] [22:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [22:11:06] PROBLEM - Host analytics1029 is DOWN: PING CRITICAL - Packet loss = 100% [22:11:32] 6operations, 7Mail: Remove Alias for sj@wm.o - https://phabricator.wikimedia.org/T108276#1647203 (10Dzahn) >>! In T108276#1643793, @Sj wrote: > If you can keep an autoresponder up for backwards compatibility, that would be great. This part is something that the OIT team would handle on the Google side of thin... [22:11:46] RECOVERY - Disk space on analytics1029 is OK: DISK OK [22:11:47] RECOVERY - YARN NodeManager Node-State on analytics1029 is OK: OK: YARN NodeManager analytics1029.eqiad.wmnet:8041 Node-State: RUNNING [22:11:56] RECOVERY - dhclient process on analytics1029 is OK: PROCS OK: 0 processes with command name dhclient [22:11:56] RECOVERY - Host analytics1029 is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms [22:12:06] RECOVERY - Check size of conntrack table on analytics1029 is OK: OK: nf_conntrack is 0 % full [22:12:07] RECOVERY - salt-minion processes on analytics1029 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [22:12:07] RECOVERY - RAID on analytics1029 is OK: OK: optimal, 13 logical, 14 physical [22:12:07] RECOVERY - Disk space on Hadoop worker on analytics1029 is OK: DISK OK [22:12:07] RECOVERY - DPKG on analytics1029 is OK: All packages OK [22:12:16] RECOVERY - SSH on analytics1029 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.3 (protocol 2.0) [22:12:37] RECOVERY - Hadoop NodeManager on analytics1029 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager [22:12:46] RECOVERY - puppet last run on analytics1029 is OK: OK: Puppet is currently enabled, last run 23 minutes ago with 0 failures [22:13:07] RECOVERY - configured eth on analytics1029 is OK: OK - interfaces up [22:17:22] 6operations, 10ops-codfw, 10netops: cr1-eqdfw PEM 0 failure - https://phabricator.wikimedia.org/T110435#1647231 (10Papaul) I went to pick up the part in shipping when I realized that the wrong part was sent to me. I did receive the routing engine with no modules and no PEM's. After 2 hours on the phone with... [22:18:07] PROBLEM - puppet last run on analytics1029 is CRITICAL: CRITICAL: Puppet has 1 failures [22:18:48] 6operations, 6Discovery: Fix CirrusSearch monitoring - https://phabricator.wikimedia.org/T84163#1647247 (10Dzahn) >>! In T84163#1387421, @Manybubbles wrote: > None. It's still on the list but the team has been concentrating on other > things that have yet to finish. If you want a quick fix I'll +1 disabling >... [22:19:16] 6operations, 10netops: Set up NTT transit @ eqdfw, eqord - https://phabricator.wikimedia.org/T111274#1647249 (10RobH) Presently awaiting completion of the EQ patch for NTT's side. Once Robert @ NTT updates us that it is complete, we'll need to put in a patch request for our panel to our router. We're intenti... [22:19:36] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [22:20:53] 6operations, 6Phabricator: Create an offboarding workflow with HR & Operations - https://phabricator.wikimedia.org/T108131#1647258 (10RobH) [22:20:57] 6operations: audit hr staff and tracking sheet (2015-08-17 revision) against shell access/ldap wmf group - https://phabricator.wikimedia.org/T109382#1647257 (10RobH) 5Open>3Resolved [22:22:13] 6operations, 10Wikimedia-Mailing-lists, 5Patch-For-Review: Education coop mailing list rename - https://phabricator.wikimedia.org/T107445#1647261 (10RobH) a:5RobH>3Selsharbaty-WMF I've reassigned this from myself to @Selsharbaty-WMF as it requires his input. [22:23:07] (03CR) 10Thcipriani: [C: 04-1] "This is awesome." (034 comments) [tools/scap] - 10https://gerrit.wikimedia.org/r/238839 (https://phabricator.wikimedia.org/T109514) (owner: 10Dduvall) [22:23:35] 6operations, 10RESTBase, 10hardware-requests: Expand RESTBase cluster capacity - https://phabricator.wikimedia.org/T93790#1647271 (10RobH) 5stalled>3Resolved https://rt.wikimedia.org/Ticket/Display.html?id=9506 has been resolved and this hardware has been racked and is now being setup via T112683 [22:23:47] PROBLEM - Restbase endpoints health on xenon is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=127.0.0.1, port=7231): Max retries exceeded with url: /en.wikipedia.org/v1/?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [22:24:17] 6operations, 6Discovery: Fix CirrusSearch monitoring - https://phabricator.wikimedia.org/T84163#1647277 (10chasemp) I didn't realize this has had a task already. I talked with @ebernhardson about it a bit and there may be some foreshadowing of issues to come. > ebernhardson > i think it might be a sign of wo... [22:25:07] (03PS2) 10Dduvall: Support atomic promotion and rollback [tools/scap] - 10https://gerrit.wikimedia.org/r/238839 (https://phabricator.wikimedia.org/T109514) [22:26:05] (03CR) 10Dduvall: "Rebased" [tools/scap] - 10https://gerrit.wikimedia.org/r/238839 (https://phabricator.wikimedia.org/T109514) (owner: 10Dduvall) [22:27:18] 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1647299 (10RobH) So the new S4 space testing so far is working out, except for a few things. * I hate having any kind access based on #projects rather than #acl*project/teamname. As such, I'll be a... [22:28:48] (03CR) 10Dduvall: Support atomic promotion and rollback (032 comments) [tools/scap] - 10https://gerrit.wikimedia.org/r/238839 (https://phabricator.wikimedia.org/T109514) (owner: 10Dduvall) [22:30:11] 7Blocked-on-Operations, 6operations, 10Traffic: upload.wikimedia.org still using old 404 error page - https://phabricator.wikimedia.org/T37053#1647315 (10Dzahn) p:5High>3Normal I don't see why it became more important now besides that it's a ticket from 2012. The only bug seems to be that it's ugly look... [22:32:27] PROBLEM - Restbase endpoints health on praseodymium is CRITICAL: /page/mobile-html/{title} is CRITICAL: Test Get MobileApps Main Page returned the unexpected status 504 (expecting: 200): /page/graph/png/{title}/{revision}/{graph_id} is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 504 (expecting: 200) [22:34:33] 6operations, 6Phabricator: Moving procurement from RT to Phabricator - https://phabricator.wikimedia.org/T93760#1647340 (10RobH) [22:34:37] PROBLEM - Restbase endpoints health on cerium is CRITICAL: /page/mobile-html/{title} is CRITICAL: Test Get MobileApps Main Page returned the unexpected status 504 (expecting: 200): /page/graph/png/{title}/{revision}/{graph_id} is CRITICAL: Test Get a graph from Graphoid returned the unexpected status 504 (expecting: 200) [22:42:13] 6operations, 3Discovery-Maps-Sprint: water_polygons import is broken - https://phabricator.wikimedia.org/T112831#1647374 (10MaxSem) 3NEW [22:44:03] (03CR) 10Yuvipanda: "w00t." [puppet] - 10https://gerrit.wikimedia.org/r/238672 (https://phabricator.wikimedia.org/T63897) (owner: 10Alex Monk) [22:45:24] Oh, I need to test that I guess [22:45:26] Where'd Yuvi run? [22:46:16] Ah, lowercase Y :) [22:46:29] yuvipanda, what TTL do you have in mind? [22:47:00] (03PS1) 10MaxSem: Temporarily disable import_waterlines cronjob [puppet] - 10https://gerrit.wikimedia.org/r/238993 [22:47:22] Krenair: 5mins? [22:47:24] hey, can I get a review of ^^ please? [22:47:48] Krenair: 1h? what is it set at now? 24h? [22:48:06] 60*60*24=86400 [22:48:13] yeah [22:48:19] so yes, assuming it's 24h in seconds [22:48:24] 3600 [22:48:26] let's start there [22:48:32] and move down later if need be [22:48:33] k [22:49:47] RECOVERY - puppet last run on analytics1029 is OK: OK: Puppet is currently enabled, last run 56 seconds ago with 0 failures [22:49:56] RECOVERY - Restbase endpoints health on praseodymium is OK: All endpoints are healthy [22:52:57] (03PS2) 10Alex Monk: Move *.labsdb aliases into DNS [puppet] - 10https://gerrit.wikimedia.org/r/238672 (https://phabricator.wikimedia.org/T63897) [22:53:37] yuvipanda, ^ [22:53:56] RECOVERY - Restbase endpoints health on xenon is OK: All endpoints are healthy [22:54:14] (03CR) 10BBlack: "It might be a good idea to generate this from data with less duplication, since there are so many aliases per IP. As in something like: d" [puppet] - 10https://gerrit.wikimedia.org/r/238672 (https://phabricator.wikimedia.org/T63897) (owner: 10Alex Monk) [22:55:47] Krenair: also it's a template without any templating.... should just be a file? [22:55:56] Krenair: or what bblack said and figure out a better way to generate, I guess [22:56:06] hmm [22:56:09] we did that before [22:56:10] let me dig that out [22:56:57] https://gerrit.wikimedia.org/r/#/c/210000/ [22:57:30] (03CR) 10Yuvipanda: "Something like https://gerrit.wikimedia.org/r/#/c/210000/ except generating a zonefile instead of hosts?" [puppet] - 10https://gerrit.wikimedia.org/r/238672 (https://phabricator.wikimedia.org/T63897) (owner: 10Alex Monk) [22:58:26] 6operations, 10Wikimedia-Mailing-lists: Maps-l: Disable or re-assign moderators - https://phabricator.wikimedia.org/T110962#1647424 (10Tfinc) Right now were focusing our conversations using discovery@ and traffic on maps-l is really small. I'd say consolidate it to discovery to get a bigger audience and re-o... [22:59:07] RECOVERY - Restbase endpoints health on cerium is OK: All endpoints are healthy [23:00:04] RoanKattouw ostriches rmoen Krenair: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20150916T2300). [23:00:15] oh, I have one I forgot to register [23:02:40] (03PS3) 10Dduvall: Support atomic promotion and rollback [tools/scap] - 10https://gerrit.wikimedia.org/r/238839 (https://phabricator.wikimedia.org/T109514) [23:03:44] (03CR) 10Dduvall: Support atomic promotion and rollback (033 comments) [tools/scap] - 10https://gerrit.wikimedia.org/r/238839 (https://phabricator.wikimedia.org/T109514) (owner: 10Dduvall) [23:12:23] (03PS4) 10Dduvall: Support atomic promotion and rollback [tools/scap] - 10https://gerrit.wikimedia.org/r/238839 (https://phabricator.wikimedia.org/T109514) [23:13:18] (03PS1) 10Eevans: create application users (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/239000 (https://phabricator.wikimedia.org/T92590) [23:13:19] !log started `nodetool rebuild -- eqiad` on restbase-test200{1,2 [23:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:16:47] PROBLEM - OTRS SMTP on mendelevium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:19:28] !log krenair@tin Synchronized php-1.26wmf23/extensions/MobileFrontend/resources/mobile.overlays/Overlay.less: https://gerrit.wikimedia.org/r/#/c/238865/ (duration: 00m 11s) [23:19:34] (done) [23:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:21:29] (03PS5) 10Dduvall: Support atomic promotion and rollback [tools/scap] - 10https://gerrit.wikimedia.org/r/238839 (https://phabricator.wikimedia.org/T109514) [23:22:07] PROBLEM - SSH on mendelevium is CRITICAL: Server answer [23:22:13] wat? [23:23:48] RECOVERY - SSH on mendelevium is OK: SSH OK - OpenSSH_6.7p1 Debian-5 (protocol 2.0) [23:24:11] (03PS6) 10Dduvall: Support atomic promotion and rollback [tools/scap] - 10https://gerrit.wikimedia.org/r/238839 (https://phabricator.wikimedia.org/T109514) [23:25:10] disabled notifications for mendelevium services [23:25:54] (03PS7) 10Dduvall: Support atomic promotion and rollback [tools/scap] - 10https://gerrit.wikimedia.org/r/238839 (https://phabricator.wikimedia.org/T109514) [23:26:31] (03CR) 10Eevans: [C: 04-1] "This probably isn't quite ready yet, but I'm sharing my progress in case Filippo wants to have it before I'm back in front of a keyboard t" [puppet] - 10https://gerrit.wikimedia.org/r/239000 (https://phabricator.wikimedia.org/T92590) (owner: 10Eevans) [23:27:45] (03CR) 10Dduvall: [C: 031] "Implemented Thcipriani's suggestions and fixed a few minor issues." [tools/scap] - 10https://gerrit.wikimedia.org/r/238839 (https://phabricator.wikimedia.org/T109514) (owner: 10Dduvall) [23:27:46] !log updating eqiad switch configs for lvs1007-1012 vlan/trunk settings [23:27:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [23:28:06] bblack, is it possible now to invalidate kartotherian cache via udp requests? [23:28:24] i remember you said you had to do something before its possible [23:30:23] I don't think we've made it a subtask (maybe should?) or if it's in one of the other tasks, but we need to assign a unique multicast address for cache_maps, and configure it in various places [23:30:48] ACKNOWLEDGEMENT - Tool Labs instance distribution on labcontrol1002 is CRITICAL: CRITICAL: master class instances not spread out enough daniel_zahn https://phabricator.wikimedia.org/T103390 [23:31:17] yuvipanda, want to keep the labs-dnsrecursor.openstack.eqiad.wmflabs instance? [23:31:23] planning to move to a separate one [23:31:39] but I don't know if you want to continue to debug what broke on that one [23:31:40] Krenair: yeah... [23:31:55] Krenair: I can't atm, unfortunately... [23:31:56] PROBLEM - Host mw1121 is DOWN: PING CRITICAL - Packet loss = 100% [23:31:56] PROBLEM - Host mw1105 is DOWN: PING CRITICAL - Packet loss = 100% [23:31:56] PROBLEM - Host mw1098 is DOWN: PING CRITICAL - Packet loss = 100% [23:31:56] PROBLEM - Host mw1110 is DOWN: PING CRITICAL - Packet loss = 100% [23:31:56] PROBLEM - Host mw1102 is DOWN: PING CRITICAL - Packet loss = 100% [23:31:57] PROBLEM - Host mw1106 is DOWN: PING CRITICAL - Packet loss = 100% [23:31:57] PROBLEM - Host mw1100 is DOWN: PING CRITICAL - Packet loss = 100% [23:31:58] PROBLEM - Host mw1120 is DOWN: PING CRITICAL - Packet loss = 100% [23:31:58] PROBLEM - Host mw1113 is DOWN: PING CRITICAL - Packet loss = 100% [23:31:59] PROBLEM - Host mw1126 is DOWN: PING CRITICAL - Packet loss = 100% [23:31:59] PROBLEM - Host mw1125 is DOWN: PING CRITICAL - Packet loss = 100% [23:32:00] PROBLEM - Host mw1107 is DOWN: PING CRITICAL - Packet loss = 100% [23:32:00] PROBLEM - Host mw1114 is DOWN: PING CRITICAL - Packet loss = 100% [23:32:01] woaaahh [23:32:01] PROBLEM - Host mw1123 is DOWN: PING CRITICAL - Packet loss = 100% [23:32:16] PROBLEM - Host mw1103 is DOWN: PING CRITICAL - Packet loss = 100% [23:32:24] is that a row going down? [23:32:26] is that you bblack? [23:32:26] PROBLEM - Host mw1097 is DOWN: PING CRITICAL - Packet loss = 100% [23:33:07] yeah, I can't access a sample of these [23:33:14] so it isn't icinga [23:33:27] must be [23:33:52] mw1097.eqiad.wmnet -> B7 [23:34:06] are they all b7? [23:34:15] checking a few [23:34:15] I just rolled back all my switch changes on all the switches [23:34:22] should undo whatever it was [23:34:42] I think ops only gives us access to that information for the scap proxies (which is how I found mw1097) [23:34:58] you can tell by subnet what row they're in, it's not really secret [23:35:14] all B7 that i checked, yes [23:35:27] PROBLEM - HTTP 5xx req/min on graphite1001 is CRITICAL: CRITICAL: 8.33% of data above the critical threshold [500.0] [23:36:29] hmmm [23:37:18] yes, it's that one row, not back yet [23:38:08] RECOVERY - Host mw1111 is UP: PING WARNING - Packet loss = 64%, RTA = 361.05 ms [23:38:13] bblack, should i create a task? [23:38:16] RECOVERY - Host mw1118 is UP: PING WARNING - Packet loss = 73%, RTA = 1.82 ms [23:38:16] RECOVERY - Host mw1099 is UP: PING WARNING - Packet loss = 73%, RTA = 16.47 ms [23:38:16] RECOVERY - Host mw1124 is UP: PING WARNING - Packet loss = 80%, RTA = 17.34 ms [23:38:16] RECOVERY - Host mw1115 is UP: PING WARNING - Packet loss = 73%, RTA = 1.03 ms [23:38:16] RECOVERY - Host mw1101 is UP: PING WARNING - Packet loss = 86%, RTA = 0.74 ms [23:38:17] RECOVERY - Host mw1117 is UP: PING WARNING - Packet loss = 73%, RTA = 7.13 ms [23:38:17] RECOVERY - Host mw1104 is UP: PING WARNING - Packet loss = 86%, RTA = 26.56 ms [23:38:18] RECOVERY - Host mw1103 is UP: PING WARNING - Packet loss = 44%, RTA = 0.73 ms [23:38:18] RECOVERY - Host mw1097 is UP: PING OK - Packet loss = 0%, RTA = 0.49 ms [23:38:19] RECOVERY - Host mw1109 is UP: PING OK - Packet loss = 0%, RTA = 1.77 ms [23:38:19] RECOVERY - Host mw1098 is UP: PING OK - Packet loss = 0%, RTA = 0.76 ms [23:38:20] RECOVERY - Host mw1110 is UP: PING OK - Packet loss = 0%, RTA = 2.07 ms [23:38:20] RECOVERY - Host mw1113 is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms [23:38:21] RECOVERY - Host mw1120 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [23:38:27] :) [23:38:36] RECOVERY - Host mw1127 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [23:38:37] RECOVERY - Host mw1107 is UP: PING OK - Packet loss = 0%, RTA = 1.52 ms [23:38:37] RECOVERY - Host mw1108 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [23:38:49] well it's not even all of that row, but I'm not really sure what it was yet [23:38:54] the switch port lists are a mess :P [23:39:04] ok, i should say all hosts were in that row [23:39:18] from a racktables point of view [23:40:14] will step away from the switches for a bit, and then try to debug exactly why those particular hosts got knocked off after... [23:40:20] 7Blocked-on-Operations, 6operations, 3Discovery-Maps-Sprint: allow maps cluster Varnish cache purging - https://phabricator.wikimedia.org/T112836#1647533 (10Yurik) 3NEW a:3BBlack [23:41:48] PROBLEM - puppet last run on mw1117 is CRITICAL: CRITICAL: puppet fail [23:42:07] PROBLEM - puppet last run on mw1097 is CRITICAL: CRITICAL: puppet fail [23:42:07] PROBLEM - puppet last run on mw1121 is CRITICAL: CRITICAL: puppet fail [23:42:17] PROBLEM - puppet last run on mw1114 is CRITICAL: CRITICAL: puppet fail [23:42:17] PROBLEM - puppet last run on mw1100 is CRITICAL: CRITICAL: puppet fail [23:42:37] PROBLEM - puppet last run on mw1123 is CRITICAL: CRITICAL: puppet fail [23:42:37] PROBLEM - puppet last run on mw1099 is CRITICAL: CRITICAL: puppet fail [23:42:38] PROBLEM - puppet last run on mw1103 is CRITICAL: CRITICAL: puppet fail [23:43:06] PROBLEM - puppet last run on mw1113 is CRITICAL: CRITICAL: puppet fail [23:43:06] PROBLEM - puppet last run on mw1105 is CRITICAL: CRITICAL: puppet fail [23:43:07] PROBLEM - puppet last run on mw1101 is CRITICAL: CRITICAL: puppet fail [23:43:17] PROBLEM - puppet last run on mw1109 is CRITICAL: CRITICAL: puppet fail [23:43:28] PROBLEM - puppet last run on mw1108 is CRITICAL: CRITICAL: puppet fail [23:43:36] PROBLEM - puppet last run on mw1126 is CRITICAL: CRITICAL: puppet fail [23:43:55] puppet failures for the same hosts? [23:44:16] PROBLEM - puppet last run on mw1098 is CRITICAL: CRITICAL: Puppet has 38 failures [23:44:23] yeah [23:44:30] it's just a delayed reaction I think [23:46:46] not quite the same hosts actually, but related, and yes delayed [23:47:17] RECOVERY - puppet last run on mw1097 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [23:47:46] RECOVERY - HTTP 5xx req/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [23:49:26] yuvipanda, because of this DNS issue I can't actually make it use self-hosted puppetmaster [23:49:45] Krenair: new host or? [23:49:50] new host, yes [23:50:01] because puppet fails