[01:10:30] 10Labs-project-Wikistats: Add olo.wikipedia to wikistats - https://phabricator.wikimedia.org/T147613#2698477 (10Dereckson) [01:14:59] 10Labs-project-Wikistats: Add olo.wikipedia to wikistats - https://phabricator.wikimedia.org/T147613#2698493 (10Dzahn) a:03Dzahn [01:15:06] 10Labs-project-Wikistats: Add olo.wikipedia to wikistats - https://phabricator.wikimedia.org/T147613#2698477 (10Dzahn) p:05Triage>03Normal [01:57:23] PROBLEM - Puppet staleness on tools-worker-1005 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [05:23:09] Legoktm: the best kind of resilient! ;-) [06:56:13] PROBLEM - Puppet run on tools-webgrid-lighttpd-1414 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [07:06:23] 06Labs, 10Tool-Labs, 06Operations, 10Traffic: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2698695 (10doctaxon) 05Resolved>03Open Hi, I think, a restart is needed again, there are too much 503 errors on several proxy servers like cp1053. A reasonable bot... [07:09:53] 06Labs, 10Tool-Labs, 06Operations, 10Traffic: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2698701 (10doctaxon) If those errors occur again and again, a technically check of these proxies has to be done, I suppose. [07:36:10] RECOVERY - Puppet run on tools-webgrid-lighttpd-1414 is OK: OK: Less than 1.00% above the threshold [0.0] [07:52:52] 06Labs, 10Tool-Labs, 06Operations, 10Traffic: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2698809 (10doctaxon) Firing with traffic (different API URLs) the error report occurs about every 1.5 minutes (!) (Sorry, but what is an unbreak now! error report, if... [09:00:49] (03PS1) 10Alexandros Kosiaris: cxserver: Add youdao_api_key dummy stanza [labs/private] - 10https://gerrit.wikimedia.org/r/314660 [09:06:57] 06Labs, 10Tool-Labs, 06Operations, 10Traffic: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2698915 (10Joe) All the restarts finished right now, the cluster should be in a much better shape now. [09:10:22] (03CR) 10Alexandros Kosiaris: [C: 032 V: 032] cxserver: Add youdao_api_key dummy stanza [labs/private] - 10https://gerrit.wikimedia.org/r/314660 (owner: 10Alexandros Kosiaris) [09:29:31] 06Labs: Create Labs project automation-framework - https://phabricator.wikimedia.org/T147629#2698953 (10Volans) [09:45:48] 06Labs, 10Tool-Labs, 06Operations, 10Traffic: repeated 503 errors for 90 minutes now on cp1065 - https://phabricator.wikimedia.org/T146451#2698986 (10Joe) 05Open>03Resolved [12:33:48] (03PS1) 10Ricordisamoa: Add proper User-Agent header to every request [labs/tools/translatemplate] - 10https://gerrit.wikimedia.org/r/314684 [13:06:21] (03PS2) 10Ricordisamoa: Add proper User-Agent header to every request [labs/tools/translatemplate] - 10https://gerrit.wikimedia.org/r/314684 [13:11:07] (03CR) 10Ricordisamoa: "PS2 adds User-Agent only if os.getenv('USERNAME') == 'tools.translatemplate'" [labs/tools/translatemplate] - 10https://gerrit.wikimedia.org/r/314684 (owner: 10Ricordisamoa) [13:14:57] 06Labs, 10Tool-Labs, 13Patch-For-Review: support python3 uwsgi apps - https://phabricator.wikimedia.org/T104374#2699272 (10Ricordisamoa) >>! In T104374#2494348, @yuvipanda wrote: > Once we migrate all the uwsgi-plain webservice to kubernetes, I think we can call this done *and* remove the uwsgi-plain webserv... [14:18:40] 06Labs, 10Tool-Labs, 06Operations, 10Traffic: repeated 503 errors for 90 minutes now - https://phabricator.wikimedia.org/T146451#2699434 (10BBlack) [14:22:03] 06Labs, 10Tool-Labs, 06Operations, 10Traffic: repeated 503 errors for 90 minutes now - https://phabricator.wikimedia.org/T146451#2661551 (10BBlack) (took the cache host out of the title to prevent confusion in future Phab searches for problems on specific cache hosts, since it didn't turn out to be relevant). [15:14:50] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [15:54:52] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [15:57:26] any labs admin around? [15:57:49] volans: yes but distracted :) [15:58:35] chasemp: :) just for T147629, whenever you've time, pinging just in case I'm doing it wrong [15:58:35] T147629: Create Labs project automation-framework - https://phabricator.wikimedia.org/T147629 [15:59:41] volans: we try to do a +1 not from the creating member I'll ping andrewbogott but it will get talked about and handled tuesday during normal business either way [15:59:48] if you're not in a rush that is [16:01:09] 06Labs: Create Labs project automation-framework - https://phabricator.wikimedia.org/T147629#2699662 (10chasemp) +1 to this, we will handle during meeting next tuesday, unless @andrew does this seem ok to you? [16:01:18] if you accpet votes from non-labs too moritzm suggested to use a labs project ;) [16:01:41] volans, this is for clustershell testing? [16:02:23] Krenair: also but not only, for now more for the auth/autz model we are going to use [16:02:50] do you have any existing code for it? [16:03:29] a lot of stubbed stuff for trying different approaches, nothing anywhere sharable [16:03:35] volans: we ...don't :D but it's primarily about "is there another project that covers this? is there quota to allocate?" facilties type questions, and even when we know it's ok we are trying to stick to the process as we ask community members to accept it [16:04:41] chasemp: make sense :) I was asking just in case [16:05:32] +1 from me [16:06:25] thx :) [17:24:51] 06Labs, 06Operations: Move maps share to labstore1003 - https://phabricator.wikimedia.org/T147657#2699887 (10madhuvishy) [17:25:07] hey o. does anyone know how to reboot wmflabs instance that is hanging when sshing in? we're trying to fix T146044 and think android-builds.wmflabs.org just needs a hard reboot [17:25:07] T146044: [BUG] Alpha builds are not being published - https://phabricator.wikimedia.org/T146044 [17:28:22] niedzielski: on the wikitech instance list there's a reboot button [17:28:28] there should also be a button in horizon [17:30:18] valhallasw`cloud: thanks! would that be here or am i looking in the wrong spot? https://wikitech.wikimedia.org/wiki/Nova_Resource:Android-builder.mobile.eqiad.wmflabs [17:32:20] https://wikitech.wikimedia.org/wiki/Special:NovaInstance [17:32:37] (the place you're looking makes sense, the interface doesn't ;-)) [17:34:20] valhallasw`cloud: hm, is this a permission based button? i'm logged in but i am not seeing any button there or when clicking through to https://wikitech.wikimedia.org/wiki/Nova_Resource:Mobile [17:37:15] niedzielski: in the right column for each instance [17:37:36] if no instances are shown, wikitech's cache is borked, and you should try logging out and in again [17:37:51] if you're not an admin for a project, I think instances are also not shown, not sure. [17:38:06] (I can't see mobile's instances) [17:39:32] valhallasw`cloud: ok, i just did that and now at least i see eqiad underneath mobile but it's not clickable [17:40:04] valhallasw`cloud: maybe because i'm not an admin [17:40:22] valhallasw`cloud: i'm just a mobile "member" [17:40:37] right [17:41:00] valhallasw`cloud: let me ping one of the admins and see if they get the button. thank you for your help!! [17:45:04] 06Labs: Create Labs project automation-framework - https://phabricator.wikimedia.org/T147629#2699972 (10Andrew) This is fine with me -- I'll create the project now. [17:46:32] 06Labs, 07Tracking: New Labs project requests (tracking) - https://phabricator.wikimedia.org/T76375#2699976 (10Andrew) [17:46:34] 06Labs: Create Labs project automation-framework - https://phabricator.wikimedia.org/T147629#2699973 (10Andrew) 05Open>03Resolved a:03Andrew This is done. Volans, you are a 'projectadmin' and can add other members or admins as you see fit. Please use the Horizon interface for any puppet work, to save me... [17:46:54] valhallasw`cloud: o/ thanks! i think bearND over in mobile was able to reboot it [17:47:22] andrewbogott: thanks a lot! [17:47:34] \o/ [17:48:11] volans: no problem — you'll need to open another ticket if you need more VMs or floating IPs or whatnot. [17:48:30] what is the default allowancE? [17:48:50] I don't remember :) [17:48:51] * andrewbogott checks [17:48:58] I don't need much [17:49:46] looks like… 8 cores, 16Gb RAM [17:50:57] more than enough, I'll probably just need to create 2~3 small VMs, at least for now :) [17:51:03] *at most [17:55:11] 06Labs, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Replace all class imports on Labs with role imports - https://phabricator.wikimedia.org/T147233#2699984 (10Andrew) [18:00:08] and yeah I'll use horizon ;) [18:01:17] * volans needs to read a lot of docs [18:08:50] kart_: any word on language-lcmd ? [19:15:50] PROBLEM - Puppet run on tools-services-02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [19:21:40] PROBLEM - Puppet run on tools-bastion-02 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [19:23:25] PROBLEM - Puppet run on tools-bastion-03 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [19:23:27] ^ result from gerrit being down [19:24:44] interesting [19:26:54] it tries to git pull composer [19:28:20] right, gotcha [19:28:21] 06Labs, 10Tool-Labs: puppet failure on tools-worker-1005.tools.eqiad.wmflabs - https://phabricator.wikimedia.org/T147672#2700345 (10valhallasw) [19:30:21] !log toolsbeta fixed certs on toolsbeta-vagrant3-scfc.toolsbeta.eqiad.wmflabs [19:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL, Master [19:30:30] !log toolsbeta (puppet certs, to be precise) [19:30:33] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Toolsbeta/SAL, Master [19:37:20] PROBLEM - Puppet run on tools-docker-builder-01 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [19:37:36] PROBLEM - Puppet run on tools-bastion-05 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [19:44:15] PROBLEM - Puppet run on tools-precise-dev is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [20:06:51] 06Labs, 06Operations, 13Patch-For-Review: Set up monitoring for secondary labstore HA cluster - https://phabricator.wikimedia.org/T144633#2700396 (10chasemp) OK things we don't monitor yet: * DRBD service state (and add it to the role to start post all resources) * A check which validates that nfs-kernel-se... [20:25:51] RECOVERY - Puppet run on tools-services-02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:26:14] Is there a guide how to deploy a django page somewhere? [20:26:43] RECOVERY - Puppet run on tools-bastion-02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:33:28] RECOVERY - Puppet run on tools-bastion-03 is OK: OK: Less than 1.00% above the threshold [0.0] [20:40:00] 06Labs, 06Operations, 13Patch-For-Review, 07Tracking: overhaul labstore setup [tracking] - https://phabricator.wikimedia.org/T126083#2700497 (10chasemp) [20:40:02] <7YUAAAAUM> 06Labs: Change the way manage-nfs-volumes is monitored - https://phabricator.wikimedia.org/T91806#2700496 (10chasemp) 05Open>03Invalid [20:42:35] RECOVERY - Puppet run on tools-bastion-05 is OK: OK: Less than 1.00% above the threshold [0.0] [20:42:36] Ok who called wikibugs <7YUAAAAUM> [20:44:15] RECOVERY - Puppet run on tools-precise-dev is OK: OK: Less than 1.00% above the threshold [0.0] [21:09:40] <7YUAAAAUM> 06Labs, 06Operations: revise/fix labstore replicate backup jobs - https://phabricator.wikimedia.org/T127567#2700580 (10chasemp) a:03madhuvishy A few notes on where this is at for madhu to take over. We have been testing backup schemes and have settled for now on something like what is described in https://p... [21:40:18] !log wikibugs manually shut down and restarted wikibugs, it had split into two separate bots for some reason [21:40:18] Did you mean tools.wikibugs instead of wikibugs? [21:40:19] wikibugs is not a valid project. [21:40:24] oops [21:40:29] !log tools.wikibugs manually shut down and restarted wikibugs, it had split into two separate bots for some reason [21:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.wikibugs/SAL, Master [21:42:55] Krenair: crazy, did restarting it settle the issues? [21:43:04] yes [21:43:23] luckily the way that tool works ensures it doesn't repeat output messages [21:43:35] one of the bots got a silly name assigned though [21:44:55] interesting [21:46:10] it has two separate jobs running - one that reads events from phab and stores them, and (ideally only) one that reads them from the queue and outputs them [21:47:29] I've seen that it was split, but didn't know that was why [21:47:50] me neither [21:47:57] qstat showed only one wb2-irc entry [21:48:04] but qdel killed both bots [21:48:06] so I blame the grid [21:57:17] PROBLEM - Puppet run on tools-webgrid-lighttpd-1406 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [22:07:53] PROBLEM - Puppet run on tools-elastic-03 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [22:37:20] RECOVERY - Puppet run on tools-webgrid-lighttpd-1406 is OK: OK: Less than 1.00% above the threshold [0.0] [22:39:33] !log tools.stewardbots restarted stewardbot, it was offline probably because of freenode [22:39:36] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools.stewardbots/SAL, Master [22:42:52] RECOVERY - Puppet run on tools-elastic-03 is OK: OK: Less than 1.00% above the threshold [0.0] [23:00:26] PROBLEM - Puppet run on bdsync-deb is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0]