[00:01:15] 06Labs, 10Labs-Infrastructure, 10Shinken, 13Patch-For-Review: shinken checks broken with "UNKNOWN: execution of the check script exited with exception ..." - https://phabricator.wikimedia.org/T154533#2915266 (10Krenair) [00:03:20] 06Labs, 10Tool-Labs, 13Patch-For-Review: Reduce Precise OGE exec hosts to 5 - https://phabricator.wikimedia.org/T154539#2915270 (10bd808) Sanity check to prevent a repeat of T149634#2758566: ``` tools-bastion-02.tools:~ bd808$ sudo qconf -sel|grep -- -12 tools-exec-1217.eqiad.wmflabs tools-exec-1218.eqiad.wm... [00:05:48] ok, puppet has done it's thing so expect shinken to get upset again [00:09:40] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [00:09:44] 06Labs, 10Tool-Labs, 13Patch-For-Review, 15User-bd808: Reduce Precise OGE exec hosts to 5 - https://phabricator.wikimedia.org/T154539#2915294 (10bd808) a:03bd808 [00:11:24] Krenair: lots of recoveries i see now. just got back. thanks! [00:11:34] mutante, yeah, but [00:11:40] it's not permanent [00:11:44] you have to approve the changes [00:12:08] looks [00:21:56] T105218 [00:21:56] T105218: check_graphite - "UNKNOWN: More than half of the datapoints are undefined " - https://phabricator.wikimedia.org/T105218 [00:22:06] is this like [00:22:23] 154533 [00:22:28] T154533 [00:22:28] T154533: shinken checks broken with "UNKNOWN: execution of the check script exited with exception ..." - https://phabricator.wikimedia.org/T154533 [00:23:14] it's more like a followup [00:23:18] one was "No valid datapoints found" and the other is undefined datapoints [00:23:24] right, yea [00:23:28] thanks! [00:23:29] the first patch fixes a bug from T105218's resolving patch [00:23:35] it's all merged on master now [00:23:47] i checked prod icinga for them [00:24:05] i only saw a single UNKNOWN there of the type "UNKNOWN: execution of the check script exited with exception list index out of range" [00:24:33] it's the "Uploads HTTP 5xx reqs/min" though [00:25:33] what was it in prod? [00:25:56] Current Status: UNKNOWN [00:25:57] (for 404d 8h 15m 21s) (Has been acknowledged) [00:25:58] Status Information: UNKNOWN: execution of the check script exited with exception list index out of range [00:26:06] Service [00:26:07] Uploads HTTP 5xx reqs/min [00:26:31] it's also using check_graphite, so i'm watching it [00:28:08] I have a feeling that's a similar error message for a different underlying bug :) [00:28:09] still, acknowledged for 404 days :( [00:28:28] 06Labs, 10Labs-Infrastructure, 10Shinken, 13Patch-For-Review: shinken checks broken with "UNKNOWN: execution of the check script exited with exception ..." - https://phabricator.wikimedia.org/T154533#2915319 (10Dzahn) follow-up for T122332 T105218 (kind of), merged thank you for the fixes! in prod icinga... [00:29:01] yea, the thing is once it's acked and never changes state.. [00:29:10] it's almost invisble [00:29:57] 06Labs, 10Labs-Infrastructure, 10Shinken, 13Patch-For-Review: shinken checks broken with "UNKNOWN: execution of the check script exited with exception ..." - https://phabricator.wikimedia.org/T154533#2915325 (10Krenair) 05Open>03Resolved [00:30:12] unless you really go to check for that. everybody goes to the unhandled stuff [00:31:49] i now see "UNKNOWN: No valid datapoints found" for the same thing [00:31:54] yep [00:32:05] in prod icinga [00:32:20] that's expected and I think correct [00:32:31] in shinken too [00:32:33] ok [00:34:44] thanks for the quick response [00:39:22] the "check script exited with exception" note indicates a bug in the check script, in this case check_graphite [00:41:49] Krenair: unrelated question, can you ssh to a random "tools-exec" instance? [00:42:00] that is why i went to shinken originally [00:43:26] oh, wait, probably pebcak [00:43:56] yea, please ignore, works [00:47:24] mutante, were you logging in as the wrong user or something/ [00:47:25] ? [00:47:35] IIRC there are restrictions on who can log into some tools hosts [00:50:17] Krenair: no, didn't have the right key loaded in ssh-add [00:50:32] ah [00:50:33] yeah [00:50:35] changed some aliases since i got the new laptop [00:50:57] are there tools-exec instances on jessie? [00:51:54] I think they use precise and trusty [00:52:35] 12* and 14* i figured out the meaning of that yea [00:52:57] the code is just there for jessie already [00:53:09] puppet code that decides which packages get installed i mean [00:53:29] mutante: we don't have gridengine packages for jessie [00:53:47] so no, no tools-exec jessie hosts [00:54:00] but we have kubernetes jessie hosts [00:54:30] 06Labs, 10Tool-Labs, 07Tracking: Packages to be added to toollabs puppet - https://phabricator.wikimedia.org/T55704#2915370 (10Dzahn) [00:54:33] 06Labs, 10Tool-Labs: Install opencv-data on toollabs - https://phabricator.wikimedia.org/T142321#2915368 (10Dzahn) 05Open>03Resolved That was fault not using the right key. All ok. I confirmed it doesn't exist on precise and is installed on trusty now. And there are no jessie instances yet afaict. ``` roo... [00:55:24] bd808: thanks, just wanted to confirm that stuff ^ [00:55:44] krenair@tools-puppetmaster-02:~$ clush -a 'lsb_release -c' | grep tools-exec | grep jessie [00:55:44] krenair@tools-puppetmaster-02:~$ [00:56:09] oh, nice :) [00:56:12] 10Tool-Labs-tools-Pageviews: Query page move log to get pageviews of an article at its older locations - https://phabricator.wikimedia.org/T141332#2915379 (10MusikAnimal) p:05Triage>03Normal [00:56:16] that clush stuff, right [00:56:54] tries to remember that for next time, thanks [00:57:00] yeah [00:57:14] it does the same job as salt, just seems more reliable [00:57:44] *nod* [01:04:19] Krenair: btw, you can do 'clush -g exec' as well [01:04:40] Krenair: see 'nodeset -l' for list of groups [01:07:09] thanks [01:09:42] RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [01:31:53] 10Tool-Labs-tools-Xtools, 06Community-Tech: Investigation: Plan for rewriting XTools - https://phabricator.wikimedia.org/T154551#2915501 (10kaldari) [01:33:05] 10Tool-Labs-tools-Xtools, 06Community-Tech: Investigation: Plan for rewriting XTools - https://phabricator.wikimedia.org/T154551#2915532 (10MusikAnimal) [02:05:04] 10Tool-Labs-tools-Xtools, 06Community-Tech: Investigation: Plan for rewriting XTools - https://phabricator.wikimedia.org/T154551#2915501 (10bd808) One of the things I would really like to see is the suite broken up in to multiple separate tool accounts. This should make it easier to keep a particular tool in t... [02:43:09] !log tools Reenabled puppet on toolschecker and removed iptables rule on labservices1001 blocking incoming connections from tools-checker-01. T152369 [02:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/SAL [02:43:14] T152369: toolschecker fell to pieces when labs-ns0 went down - https://phabricator.wikimedia.org/T152369 [05:17:02] PROBLEM - Puppet run on tools-grid-shadow is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [05:17:12] PROBLEM - Puppet run on tools-cron-01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [05:17:58] PROBLEM - Puppet run on tools-worker-1020 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [05:19:20] PROBLEM - Puppet run on tools-webgrid-lighttpd-1416 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [05:20:16] PROBLEM - Puppet run on tools-exec-1420 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [05:20:55] PROBLEM - Puppet run on tools-exec-1220 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [05:57:00] RECOVERY - Puppet run on tools-grid-shadow is OK: OK: Less than 1.00% above the threshold [0.0] [05:57:12] RECOVERY - Puppet run on tools-cron-01 is OK: OK: Less than 1.00% above the threshold [0.0] [05:57:58] RECOVERY - Puppet run on tools-worker-1020 is OK: OK: Less than 1.00% above the threshold [0.0] [05:59:21] RECOVERY - Puppet run on tools-webgrid-lighttpd-1416 is OK: OK: Less than 1.00% above the threshold [0.0] [06:00:15] RECOVERY - Puppet run on tools-exec-1420 is OK: OK: Less than 1.00% above the threshold [0.0] [06:00:53] RECOVERY - Puppet run on tools-exec-1220 is OK: OK: Less than 1.00% above the threshold [0.0] [06:38:41] PROBLEM - Puppet run on tools-exec-1416 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [06:52:37] PROBLEM - Puppet run on tools-k8s-etcd-03 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [07:13:40] RECOVERY - Puppet run on tools-exec-1416 is OK: OK: Less than 1.00% above the threshold [0.0] [07:32:36] RECOVERY - Puppet run on tools-k8s-etcd-03 is OK: OK: Less than 1.00% above the threshold [0.0] [07:39:46] 10Tool-Labs-tools-Pageviews, 07I18n: massviews-category-description lego for "category" - https://phabricator.wikimedia.org/T146973#2915946 (10Nikerabbit) Sorry for babbling too much :). For this particular case I recommend two separate messages, good message documentation, and no message re-use. [08:57:58] Change on 12wikitech.wikimedia.org a page Nova Resource:Tools/Access Request/Chuenlye was created, changed by Chuenlye link https://wikitech.wikimedia.org/wiki/Nova_Resource:Tools/Access_Request/Chuenlye edit summary: Created page with "{{Tools Access Request |Justification=I want to be a volunteer as an infrastructure engineer. So, at first I want to learn more about wikimedia labs/tool labs. |Completed=fals..." [10:35:41] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [10:50:27] 06Labs, 10Tool-Labs, 10DBA: Spatial database for tool-labs - https://phabricator.wikimedia.org/T154497#2916154 (10akosiaris) >>! In T154497#2913411, @scfc wrote: > There is a PostgreSQL database that replicates data from OSM and is used by some tools; currently accounts for that database are managed manually... [11:27:31] Do we have an easy way to list all tools.wmflabs.org repository URLs? I want to add them to https://www.openhub.net/p/mediawiki-webtools [11:40:06] !log ores reboot ores-lb-02 [11:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL [11:41:45] !log ores stopping precaching in ores in labs to reduce load [11:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL [11:50:16] Nemo_bis: repo urls? some are on phab, some are on github, some are on bitbucket... [11:55:41] Nemo_bis: can I add mine? [12:04:02] * zhuyifei1999_ is adding a few [12:17:29] Nemo_bis: added these https://www.openhub.net/p/mediawiki-webtools/enlistments?query=toollabs&sort=by_url [12:51:30] !log ores ran 'sudo service uwsgi-ores restart' on ores-web-03 [12:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Ores/SAL [14:01:22] 06Labs, 06Operations, 07kubernetes: docker-engine pulled into our repositories only keeps the latest version - https://phabricator.wikimedia.org/T153416#2916419 (10akosiaris) Unfortunately, as discussed multiple times in the past, reprepro does not really allow us to easily have more than one version of a pa... [15:15:09] Hi! [15:31:58] PROBLEM - Puppet run on tools-worker-1003 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [0.0] [16:37:31] 06Labs, 07Tracking: New Labs project requests (tracking) - https://phabricator.wikimedia.org/T76375#2916757 (10Andrew) [16:37:34] 06Labs: Request creation of twl-staging labs project - https://phabricator.wikimedia.org/T153549#2916756 (10Andrew) 05Open>03declined [16:49:10] 06Labs: Request increased quota for services-test labs project - https://phabricator.wikimedia.org/T153711#2916774 (10Eevans) Anyway I could get an ETA on this, so that I know how to plan? [17:26:24] 06Labs, 10Tool-Labs: Linkwatcher spawns many processes without parent - https://phabricator.wikimedia.org/T123121#2916884 (10Beetstra) @valhallasw, do you mind to clear the instance linkwatcher is on, there are three heavy python scripts there as well, and LiWa3 is building up a massive backlog. Thanks. [17:39:26] 06Labs, 10Tool-Labs: Linkwatcher spawns many processes without parent - https://phabricator.wikimedia.org/T123121#2916920 (10valhallasw) No, because those are tasks that cannot be restarted. I rescheduled linkwatcher yesterday, but due to the massive memory requirement (26GB, while we only have 8GB mem + 25GB... [17:53:23] 06Labs, 10Labs-Infrastructure, 10DBA, 06Operations, 13Patch-For-Review: Migrate labsdb1005/1006/1007 to jessie - https://phabricator.wikimedia.org/T123731#1936600 (10faidon) Early January is here and the 15th is coming up fast ­­-- @yuvipanda rightfully mentioned above that this will need a (presumably a... [17:53:59] 06Labs, 15User-Nikerabbit: Revert in 01/2017: Request creation of wmwcourse labs project - https://phabricator.wikimedia.org/T144388#2916973 (10Nikerabbit) I started shutting down instances that are no longer in use. Please don't delete the ones still running just yet. [17:55:47] 06Labs, 10Labs-Infrastructure, 10DBA, 06Operations, 13Patch-For-Review: Migrate labsdb1005/1006/1007 to jessie - https://phabricator.wikimedia.org/T123731#2916980 (10yuvipanda) @jcrespo How does Jan 25 / 26 work for you? [18:03:48] 06Labs, 10Labs-Infrastructure, 10DBA, 06Operations, 13Patch-For-Review: Migrate labsdb1005/1006/1007 to jessie - https://phabricator.wikimedia.org/T123731#2917046 (10jcrespo) >>! In T123731#2916980, @yuvipanda wrote: > @jcrespo How does Jan 25 / 26 work for you? +1. let's meet at some point to organize... [18:55:30] I do not really know whom to ask and what to ask. I have here a bot "tools.persondata" which stopped working an Nov. 4th. The only thing I see is "Login failed. Check your username and password." on stderr. The bot is written in C#/Mono and I do not have the sources. [18:55:44] Any idea which password is missing? [18:56:17] A file with secrets which got deleted? [19:00:41] 06Labs, 10Tool-Labs: Linkwatcher spawns many processes without parent - https://phabricator.wikimedia.org/T123121#2917266 (10Beetstra) hmm. Any idea how long those 3 python scripts will stay? linkwatcher will munch away its backlog in time. Until the wikimedia linklog system comes online I don't foresee a w... [19:03:56] Wurgl: you can `strace` it [19:04:59] to see which files are read [19:06:41] Wurgl: what zhuyifei1999_ said is probably the only choice, it's a shame the source isn't obvious hm [19:07:48] I did already strace it � [19:09:37] But I did not see any helpful information [19:09:47] open("/mnt/nfs/labstore-secondary-tools-project/persondata/Cache/CommonData.xml", O_RDONLY) = 5 [19:09:48] open("/mnt/nfs/labstore-secondary-tools-project/persondata/Cache/https.de.wikipedia.org.xml", O_RDONLY) = 5 [19:10:45] Wurgl: so you'll probably have to start it locally w/ strace as in it most likely reads the credentials file early [19:10:52] maybe that's what you did tho :) [19:10:55] I lurked into the corresponding files of other bots � but I did not miss anything, btw: modification date is ~ feb 2015 [19:11:53] locally does not work, some resource is missing and a timeout kills me (throws an exception) [19:12:21] chasemp: Eric is blocked on the quota change described in https://phabricator.wikimedia.org/T153711; could you tackle that soon? [19:14:16] gwicke: yep the review meeting was today due to holidays so most likely today or tomorrow we'll knock it out (seems reasonable and temp but takes two of us to grok) [19:14:17] no worries [19:14:19] Sadly I do not have the sources � and the author is � hmm � away [19:15:57] 06Labs: Request increased quota for services-test labs project - https://phabricator.wikimedia.org/T153711#2888152 (10chasemp) Hey @Eevans thanks for being patient with the holidays we are just now back in an almost normal session. Seems cool, thanks for noting this is temp (any rough timeline?). +1 [19:16:56] I also looked with "strings" into the executable, but did not see any string which smells like a password [19:18:23] Wurgl: honestly reverse engineering this with no docs is a potentially deep affair and I don't have the ability at the moment, one thing is to make a task we can @ping the author on to get sources into diffusion or github so we can fix this but I know that's not an immediate answer [19:18:49] There are three databases used by that whole bunch of programs (~100 php scripts, two C#/mono executables), but all three databasdes are accessible with mysql on the commandline. and replica.my.cnf was not changed since 2014 [19:19:11] 06Labs: Request increased quota for services-test labs project - https://phabricator.wikimedia.org/T153711#2888152 (10Andrew) Approved, will do shortly [19:21:06] I was lucky to get in contact with the author to reach the status of a co-maintainer, so I have access and "could" modify. A mail is already waiting in his inbox � so it may just take some days/weeks � :-( [19:21:30] Wurgl: I would persist this all to a task in phabricator so we can keep the narrative running [19:21:41] I'll try to take a look when I can as well, maybe one of us will have a bright idea [19:21:49] but one thing I know for sure, it won't get looked at w/o a task [19:21:55] sorry it's a bummer [19:28:35] No need for you to solve it, at least for now. [19:28:59] The problem exists since 4th of November, so � [19:29:02] 06Labs: Request increased quota for services-test labs project - https://phabricator.wikimedia.org/T153711#2917384 (10Andrew) 05Open>03stalled Done -- please notify us when you're done with the additional instance. [19:29:16] 06Labs: Revert increased quota for services-test labs project - https://phabricator.wikimedia.org/T153711#2917390 (10Andrew) [19:30:35] Wurgl: :) I never miss a chance to bang the 'publish your source code publicly' drum ;) [19:32:10] me too [19:32:16] 06Labs, 06Operations, 13Patch-For-Review: cronspam from labstores, labcontrol, labstestservices - https://phabricator.wikimedia.org/T149574#2917402 (10faidon) 05Open>03Resolved a:03faidon I don't think there's anything else to be done for this and it seems to have been largely ignored anyway. Resolving. [19:33:46] chasemp: thanks! [19:39:36] 06Labs, 06Operations, 10video2commons: Consider mounting labs NFS labstore1003.eqiad.wmnet:/scratch for server-side uploads - https://phabricator.wikimedia.org/T153068#2917434 (10yuvipanda) [19:40:41] RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:05:17] PROBLEM - Puppet staleness on tools-worker-1003 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [43200.0] [20:06:43] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [20:09:34] PROBLEM - Puppet run on tools-webgrid-lighttpd-1403 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [20:09:40] PROBLEM - Puppet run on tools-exec-1406 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [20:10:05] 10PAWS: Kernel consistently crashing (out of memory issue?) - https://phabricator.wikimedia.org/T154608#2917572 (10Nettrom) [20:10:16] PROBLEM - Puppet run on tools-redis-1002 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [20:10:19] 10PAWS: Kernel consistently crashing - https://phabricator.wikimedia.org/T154608#2917572 (10Nettrom) [20:10:26] PROBLEM - Puppet run on tools-checker-02 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [20:10:32] PROBLEM - Puppet run on tools-static-11 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [20:10:56] PROBLEM - Puppet run on tools-exec-1214 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [20:11:50] PROBLEM - Puppet run on tools-worker-1013 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [20:12:13] PROBLEM - Puppet run on tools-webgrid-lighttpd-1412 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [20:12:19] PROBLEM - Puppet run on tools-prometheus-01 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [20:13:11] PROBLEM - Puppet run on tools-webgrid-lighttpd-1209 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [20:13:13] PROBLEM - Puppet run on tools-webgrid-lighttpd-1210 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [20:13:19] PROBLEM - Puppet run on tools-worker-1001 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [20:13:19] PROBLEM - Puppet run on tools-exec-1410 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [20:13:27] PROBLEM - Puppet run on tools-exec-1409 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [20:13:35] PROBLEM - Puppet run on tools-k8s-etcd-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [20:13:39] PROBLEM - Puppet run on tools-mail-01 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [0.0] [20:13:41] PROBLEM - Puppet run on tools-worker-1016 is CRITICAL: CRITICAL: 12.50% of data above the critical threshold [0.0] [20:13:51] PROBLEM - Puppet run on tools-webgrid-lighttpd-1409 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [20:13:53] PROBLEM - Puppet run on tools-worker-1002 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [20:13:58] PROBLEM - Puppet run on tools-k8s-master-01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [20:14:00] PROBLEM - Puppet run on tools-webgrid-lighttpd-1408 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [20:14:14] PROBLEM - Puppet run on tools-exec-1212 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [20:14:20] PROBLEM - Puppet run on tools-mail is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [20:14:42] PROBLEM - Puppet run on tools-exec-1216 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [20:14:54] PROBLEM - Puppet run on tools-exec-gift is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [20:15:14] PROBLEM - Puppet run on tools-webgrid-lighttpd-1205 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [20:15:21] 10PAWS: Kernel consistently crashing - https://phabricator.wikimedia.org/T154608#2917572 (10yuvipanda) I'm pretty sure it's just running out of memory. How big is the data being imported? [20:15:34] PROBLEM - Puppet run on tools-flannel-etcd-02 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [20:15:42] PROBLEM - Puppet run on tools-bastion-02 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [20:16:10] PROBLEM - Puppet run on tools-webgrid-lighttpd-1410 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [20:16:38] see -operations channel for some details, expecting recoveries [20:16:48] PROBLEM - Puppet run on tools-exec-1217 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [20:16:58] wikitech had a problem and recovered [20:16:59] PROBLEM - Puppet run on tools-webgrid-generic-1402 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [20:17:17] PROBLEM - Puppet run on tools-worker-1009 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [20:17:21] PROBLEM - Puppet run on tools-exec-1412 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [20:17:23] thanks mutante for noticing [20:17:56] 10PAWS: Kernel consistently crashing - https://phabricator.wikimedia.org/T154608#2917640 (10Nettrom) 3,746,600 rows. The file I'm importing is 259MiB when unzipped. [20:18:19] PROBLEM - Puppet run on tools-elastic-01 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [20:18:23] PROBLEM - Puppet run on tools-bastion-03 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [20:18:30] PROBLEM - Puppet run on tools-worker-1017 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [20:18:30] PROBLEM - Puppet run on tools-worker-1018 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [20:19:25] PROBLEM - Puppet run on tools-worker-1011 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [20:19:35] PROBLEM - Puppet run on tools-k8s-etcd-02 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [20:33:23] 10Tool-Labs-tools-Xtools, 06Community-Tech: Investigation: Plan for rewriting XTools - https://phabricator.wikimedia.org/T154551#2917678 (10Matthewrbowker) Hello, @kaldari! Thank you for creating this task. The current components of xTools are as follows (I've already checked off the ones I've converted th... [20:43:46] 10Tool-Labs-tools-Xtools, 06Community-Tech: Investigation: Plan for rewriting XTools - https://phabricator.wikimedia.org/T154551#2917696 (10bd808) >>! In T154551#2917678, @Matthewrbowker wrote: > @bd808: @MusikAnimal and I have been discussing moving xTools to its own Labs instance, or multiple labs instances... [20:45:28] RECOVERY - Puppet run on tools-static-11 is OK: OK: Less than 1.00% above the threshold [0.0] [20:46:51] RECOVERY - Puppet run on tools-worker-1013 is OK: OK: Less than 1.00% above the threshold [0.0] [20:47:19] RECOVERY - Puppet run on tools-prometheus-01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:48:17] RECOVERY - Puppet run on tools-worker-1001 is OK: OK: Less than 1.00% above the threshold [0.0] [20:48:39] RECOVERY - Puppet run on tools-mail-01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:48:55] RECOVERY - Puppet run on tools-worker-1002 is OK: OK: Less than 1.00% above the threshold [0.0] [20:49:31] RECOVERY - Puppet run on tools-webgrid-lighttpd-1403 is OK: OK: Less than 1.00% above the threshold [0.0] [20:49:39] RECOVERY - Puppet run on tools-exec-1406 is OK: OK: Less than 1.00% above the threshold [0.0] [20:50:17] RECOVERY - Puppet run on tools-redis-1002 is OK: OK: Less than 1.00% above the threshold [0.0] [20:50:27] RECOVERY - Puppet run on tools-checker-02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:50:54] RECOVERY - Puppet run on tools-exec-1214 is OK: OK: Less than 1.00% above the threshold [0.0] [20:51:50] RECOVERY - Puppet run on tools-exec-1217 is OK: OK: Less than 1.00% above the threshold [0.0] [20:52:10] RECOVERY - Puppet run on tools-webgrid-lighttpd-1412 is OK: OK: Less than 1.00% above the threshold [0.0] [20:52:16] RECOVERY - Puppet run on tools-worker-1009 is OK: OK: Less than 1.00% above the threshold [0.0] [20:53:10] RECOVERY - Puppet run on tools-webgrid-lighttpd-1210 is OK: OK: Less than 1.00% above the threshold [0.0] [20:53:12] RECOVERY - Puppet run on tools-webgrid-lighttpd-1209 is OK: OK: Less than 1.00% above the threshold [0.0] [20:53:20] RECOVERY - Puppet run on tools-elastic-01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:53:22] RECOVERY - Puppet run on tools-exec-1410 is OK: OK: Less than 1.00% above the threshold [0.0] [20:53:28] RECOVERY - Puppet run on tools-exec-1409 is OK: OK: Less than 1.00% above the threshold [0.0] [20:53:29] RECOVERY - Puppet run on tools-worker-1017 is OK: OK: Less than 1.00% above the threshold [0.0] [20:53:29] RECOVERY - Puppet run on tools-worker-1018 is OK: OK: Less than 1.00% above the threshold [0.0] [20:53:37] RECOVERY - Puppet run on tools-k8s-etcd-01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:53:42] RECOVERY - Puppet run on tools-worker-1016 is OK: OK: Less than 1.00% above the threshold [0.0] [20:53:52] RECOVERY - Puppet run on tools-webgrid-lighttpd-1409 is OK: OK: Less than 1.00% above the threshold [0.0] [20:53:58] RECOVERY - Puppet run on tools-k8s-master-01 is OK: OK: Less than 1.00% above the threshold [0.0] [20:54:00] RECOVERY - Puppet run on tools-webgrid-lighttpd-1408 is OK: OK: Less than 1.00% above the threshold [0.0] [20:54:14] RECOVERY - Puppet run on tools-exec-1212 is OK: OK: Less than 1.00% above the threshold [0.0] [20:54:20] RECOVERY - Puppet run on tools-mail is OK: OK: Less than 1.00% above the threshold [0.0] [20:54:24] RECOVERY - Puppet run on tools-worker-1011 is OK: OK: Less than 1.00% above the threshold [0.0] [20:54:32] RECOVERY - Puppet run on tools-k8s-etcd-02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:54:42] RECOVERY - Puppet run on tools-exec-1216 is OK: OK: Less than 1.00% above the threshold [0.0] [20:54:54] RECOVERY - Puppet run on tools-exec-gift is OK: OK: Less than 1.00% above the threshold [0.0] [20:55:12] RECOVERY - Puppet run on tools-webgrid-lighttpd-1205 is OK: OK: Less than 1.00% above the threshold [0.0] [20:55:34] RECOVERY - Puppet run on tools-flannel-etcd-02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:55:45] RECOVERY - Puppet run on tools-bastion-02 is OK: OK: Less than 1.00% above the threshold [0.0] [20:56:11] RECOVERY - Puppet run on tools-webgrid-lighttpd-1410 is OK: OK: Less than 1.00% above the threshold [0.0] [20:56:45] 10Tool-Labs-tools-Xtools, 06Community-Tech: Investigation: Plan for rewriting XTools - https://phabricator.wikimedia.org/T154551#2917712 (10Matthewrbowker) @bd808 I understand. My comment with divorcing the code from Tool Labs more had to do with fun like [[https://github.com/x-tools/xtools/blob/master/public... [20:56:57] RECOVERY - Puppet run on tools-webgrid-generic-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [20:57:19] RECOVERY - Puppet run on tools-exec-1412 is OK: OK: Less than 1.00% above the threshold [0.0] [20:58:21] RECOVERY - Puppet run on tools-bastion-03 is OK: OK: Less than 1.00% above the threshold [0.0] [21:00:27] 10Tool-Labs-tools-Xtools, 06Community-Tech: Investigation: Plan for rewriting XTools - https://phabricator.wikimedia.org/T154551#2917738 (10tom29739) >>! In T154551#2917678, @Matthewrbowker wrote: > The rebirth code is written using the Symfony framework: http://symfony.com/. I opted to use the framework beca... [21:05:00] 10Tool-Labs-tools-Xtools, 06Community-Tech: Investigation: Plan for rewriting XTools - https://phabricator.wikimedia.org/T154551#2917740 (10Matthewrbowker) @tom29739 Thank you for that information. I'll take a look at that when I have a few moments, and see if there's a workaround. It may require adding an i... [21:05:35] !log deployment-prep deployment-cache-text04 stopping nginx service, running puppet to debug dependency issue [21:05:41] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL [21:06:05] this gave us a new error to debug from [21:06:06] rror 400 on SERVER: Reading data from Hosts/deployment-cache-text04 failed: NoMethodError: undefined method `[]' for nil:NilClass at /etc/puppet/manifests/realm.pp:40 [21:06:54] ok, so nginx can't restart [21:07:06] and that broke the dependency cycle on the acme stuff, afaict [21:07:16] when i manually stopped it, we get the new error above [21:07:38] at /etc/puppet/manifests/realm.pp:40 sounds ... well.. i dunno [21:07:44] like something more general is broken [21:09:37] so because it tries to do the Hiera lookup from Wikitech and Wikitech is broken [21:09:41] due to the mw deployment earlier [21:10:03] but that was reverted? i dont know [21:15:31] 39 if $realm == undef { [21:15:36] 40 $realm = hiera('realm', 'production') [21:16:00] chasemp: Krenair ^ so, it tries to do that lookup, what is the realm, and the lookup itself fails [21:16:42] RECOVERY - Puppet run on tools-services-01 is OK: OK: Less than 1.00% above the threshold [0.0] [21:16:42] yes [21:16:43] so, yes, Hiera, but thcipriani already reverted [21:16:57] does it cache those errors perhaps? [21:17:10] Warning: Not using cache on failed catalog [21:17:23] hrm so things like RUBYLIB=/var/lib/puppet/lib hiera realm ::instanceproject=staging ::hostname=$(hostname -s) --debug -c /etc/puppet/hiera.yaml work in at least 1 project [21:18:15] IIRC there is some kind of caching for the mwyaml backend...not that I can remember how it works [21:19:09] I want to restart the puppetmaster [21:19:13] think I should thcipriani? [21:19:24] modules/wmflib/lib/hiera/mwcache.rb makes it look like the cache is just in memory [21:20:05] so we are getting a 404 from the puppetmaster, and i think that sounds like an idea [21:20:10] restarting the master [21:20:41] eh, that looks like i am already doing it, didn't mean that [21:20:46] where is it though [21:21:14] the puppetmaster? [21:21:21] server = deployment-puppetmaster02.deployment-prep.eqiad.wmflabs [21:21:21] deployment-puppetmaster02 [21:21:42] do you want me to do it? [21:21:49] i'll do it [21:21:53] thanks [21:22:16] yeah, cache looks to be just in memory (/me late) [21:23:00] pfff [21:23:16] it's still stuck on the LE issue [21:23:20] I'm not sure why [21:23:30] the dependencies shown on the error don't seem to match up with the code [21:23:31] it was because it could not start nginx [21:23:46] nginx is already running and it seems to start successfully [21:23:48] i see, back to the error before [21:23:54] did puppet start it ? [21:23:58] or a human [21:24:05] I just stopped it and started it manually just to be sure [21:24:21] i meant let puppet start the service, but i already tried that earlier.. hmm [21:24:32] and then i got the Hiera error, heh [21:25:04] tries that one more time [21:25:27] yea, same problem [21:26:08] !log deployment-prep trying to troubleshoot puppet by stopping nginx then letting puppet start it [21:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL [21:26:14] pff [21:26:28] it didn't do it [21:26:29] it did error [21:26:43] yep [21:27:20] we should try to run the acme-setup command manually now [21:27:37] it's gonne be .. ehm /usr/local/sbin/acme-setup -i .. -s ... [21:27:40] this dependency still doesn't seem correct [21:28:18] that'll be fun to generate the full command for [21:29:27] hang on [21:29:38] !log deployment-cache-text-04 - running acme-setup command to debug [21:29:39] Unknown project "deployment-cache-text-04" [21:29:48] modules/tlsproxy/manifests/localssl.pp has the letsencrypt::cert::integrated before => Service['nginx'] [21:29:57] !log deployment-prep deployment-cache-text-04 - running acme-setup command to debug .. Creating CSR /etc/acme/csr/beta_wmflabs_org.pem [21:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Nova_Resource:Deployment-prep/SAL [21:30:41] nothing changed ... [21:30:42] musikanimal, what was the full command you ran? [21:30:51] root@deployment-cache-text04:~# /usr/local/sbin/acme-setup -i beta_wmflabs_org -s beta.wmflabs.org [21:31:12] mutante, * sorry musikanimal [21:31:14] taken from: [21:31:14] 13 exec { "acme-setup-self-${safe_title}": [21:31:15] 114 command => $base_cmd, [21:31:15] 115 creates => "/etc/acme/cert/${safe_title}.crt", [21:31:25] okay [21:31:32] 110 $base_cmd = "/usr/local/sbin/acme-setup -i ${safe_title} -s ${subjects}" [21:31:32] I just commented a dependency and it's happy again [21:31:37] oh? [21:31:56] where did you comment? [21:32:07] essentially reverting https://gerrit.wikimedia.org/r/#/c/327465/ [21:32:27] oh! but 3 weeks ago [21:32:30] yeah [21:32:35] so there was the theory that this was lingering [21:32:37] let's see [21:32:37] but how [21:33:09] that change totally looks related [21:33:21] but did puppet not run for a while until today? [21:35:05] someone was messing with the update process on december 16th [21:35:31] about a day after that commit [21:36:08] someone did a rebase today [21:37:04] where is hashar [21:55:24] 06Labs, 10Labs-Infrastructure: Deprecate precise instances in Labs - https://phabricator.wikimedia.org/T143349#2917914 (10Mattflaschen-WMF) [21:58:11] 06Labs, 10wikitech.wikimedia.org: Wikitech blank page and no logs with mediawiki 1.29.0-wmf.7 - https://phabricator.wikimedia.org/T154618#2917919 (10thcipriani) [21:58:50] 06Labs, 10wikitech.wikimedia.org: Wikitech blank page and no logs with mediawiki 1.29.0-wmf.7 - https://phabricator.wikimedia.org/T154618#2917933 (10thcipriani) [22:04:28] 06Labs, 10wikitech.wikimedia.org: Wikitech blank page and no logs with mediawiki 1.29.0-wmf.7 - https://phabricator.wikimedia.org/T154618#2917957 (10thcipriani) p:05Triage>03High [22:11:57] 06Labs, 10wikitech.wikimedia.org: Wikitech blank page and no logs with mediawiki 1.29.0-wmf.7 - https://phabricator.wikimedia.org/T154618#2918001 (10thcipriani) Well it looks like there are logs on silver itself, and this also explains why no other wikis were affected ``` [Wed Jan 04 20:06:56.757579 2017] [:e... [22:14:40] 06Labs, 10wikitech.wikimedia.org: Wikitech blank page and no logs with mediawiki 1.29.0-wmf.7 - https://phabricator.wikimedia.org/T154618#2918009 (10thcipriani) I think I may have found the issue: https://github.com/wikimedia/mediawiki-extensions-SemanticForms/blob/master/RENAMED.txt is the only thing in php-1... [22:23:21] 06Labs: Upgrade Labs to OpenStack Mitaka - https://phabricator.wikimedia.org/T145919#2918026 (10Andrew) [22:23:23] 06Labs, 10Labs-Infrastructure, 13Patch-For-Review: Keystone: eventlet deprecated in M - https://phabricator.wikimedia.org/T150774#2918024 (10Andrew) 05Open>03Resolved a:03Andrew [22:40:28] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 15User-bd808: Facilitate Volunteer NDA application process for potential Tool Labs standards committee appointees - https://phabricator.wikimedia.org/T154625#2918138 (10bd808) [22:55:18] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 15User-bd808: Facilitate Volunteer NDA application process for potential Tool Labs standards committee appointees - https://phabricator.wikimedia.org/T154625#2918195 (10chasemp) [23:07:40] PROBLEM - Puppet run on tools-services-01 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [0.0] [23:14:09] PROBLEM - Puppet run on tools-webgrid-lighttpd-1209 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [23:14:21] PROBLEM - Puppet run on tools-elastic-01 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [23:14:29] PROBLEM - Puppet run on tools-worker-1017 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [23:14:36] PROBLEM - Puppet run on tools-k8s-etcd-01 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [23:14:52] PROBLEM - Puppet run on tools-webgrid-lighttpd-1409 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [23:15:00] PROBLEM - Puppet run on tools-webgrid-lighttpd-1408 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [0.0] [23:15:20] PROBLEM - Puppet run on tools-mail is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [23:15:24] PROBLEM - Puppet run on tools-worker-1011 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [23:15:33] PROBLEM - Puppet run on tools-k8s-etcd-02 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [0.0] [23:15:53] PROBLEM - Puppet run on tools-exec-gift is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [23:16:37] PROBLEM - Puppet run on tools-flannel-etcd-02 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [23:17:09] PROBLEM - Puppet run on tools-webgrid-lighttpd-1410 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [23:17:58] PROBLEM - Puppet run on tools-webgrid-generic-1402 is CRITICAL: CRITICAL: 44.44% of data above the critical threshold [0.0] [23:18:00] PROBLEM - Puppet run on tools-worker-1012 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [23:18:02] PROBLEM - Puppet run on tools-grid-shadow is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [0.0] [23:18:13] PROBLEM - Puppet run on tools-cron-01 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [0.0] [23:18:17] PROBLEM - Puppet run on tools-worker-1009 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [23:18:19] PROBLEM - Puppet run on tools-exec-1412 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [23:19:21] PROBLEM - Puppet run on tools-bastion-03 is CRITICAL: CRITICAL: 66.67% of data above the critical threshold [0.0] [23:19:29] PROBLEM - Puppet run on tools-worker-1018 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [0.0] [23:19:59] PROBLEM - Puppet run on tools-exec-1418 is CRITICAL: CRITICAL: 50.00% of data above the critical threshold [0.0] [23:20:19] PROBLEM - Puppet run on tools-worker-1006 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [23:20:20] PROBLEM - Puppet run on tools-webgrid-lighttpd-1416 is CRITICAL: CRITICAL: 55.56% of data above the critical threshold [0.0] [23:34:30] 06Labs, 10Tool-Labs, 06Community-Tech-Tool-Labs, 15User-bd808: Facilitate Volunteer NDA application process for potential Tool Labs standards committee appointees - https://phabricator.wikimedia.org/T154625#2918251 (10bd808) [23:42:24] (03PS1) 10Volans: Keyholder: add dummy keys for Cumin [labs/private] - 10https://gerrit.wikimedia.org/r/330600 (https://phabricator.wikimedia.org/T154588) [23:51:35] RECOVERY - Puppet run on tools-flannel-etcd-02 is OK: OK: Less than 1.00% above the threshold [0.0] [23:53:16] RECOVERY - Puppet run on tools-worker-1009 is OK: OK: Less than 1.00% above the threshold [0.0] [23:54:10] RECOVERY - Puppet run on tools-webgrid-lighttpd-1209 is OK: OK: Less than 1.00% above the threshold [0.0] [23:54:19] RECOVERY - Puppet run on tools-elastic-01 is OK: OK: Less than 1.00% above the threshold [0.0] [23:54:23] RECOVERY - Puppet run on tools-bastion-03 is OK: OK: Less than 1.00% above the threshold [0.0] [23:54:29] RECOVERY - Puppet run on tools-worker-1017 is OK: OK: Less than 1.00% above the threshold [0.0] [23:54:29] RECOVERY - Puppet run on tools-worker-1018 is OK: OK: Less than 1.00% above the threshold [0.0] [23:54:37] RECOVERY - Puppet run on tools-k8s-etcd-01 is OK: OK: Less than 1.00% above the threshold [0.0] [23:54:52] RECOVERY - Puppet run on tools-webgrid-lighttpd-1409 is OK: OK: Less than 1.00% above the threshold [0.0] [23:55:01] RECOVERY - Puppet run on tools-webgrid-lighttpd-1408 is OK: OK: Less than 1.00% above the threshold [0.0] [23:55:19] RECOVERY - Puppet run on tools-worker-1006 is OK: OK: Less than 1.00% above the threshold [0.0] [23:55:20] RECOVERY - Puppet run on tools-webgrid-lighttpd-1416 is OK: OK: Less than 1.00% above the threshold [0.0] [23:55:21] RECOVERY - Puppet run on tools-mail is OK: OK: Less than 1.00% above the threshold [0.0] [23:55:23] RECOVERY - Puppet run on tools-worker-1011 is OK: OK: Less than 1.00% above the threshold [0.0] [23:55:34] RECOVERY - Puppet run on tools-k8s-etcd-02 is OK: OK: Less than 1.00% above the threshold [0.0] [23:55:54] RECOVERY - Puppet run on tools-exec-gift is OK: OK: Less than 1.00% above the threshold [0.0] [23:57:10] RECOVERY - Puppet run on tools-webgrid-lighttpd-1410 is OK: OK: Less than 1.00% above the threshold [0.0] [23:57:54] RECOVERY - Puppet run on tools-webgrid-generic-1402 is OK: OK: Less than 1.00% above the threshold [0.0] [23:58:02] RECOVERY - Puppet run on tools-worker-1012 is OK: OK: Less than 1.00% above the threshold [0.0] [23:58:02] RECOVERY - Puppet run on tools-grid-shadow is OK: OK: Less than 1.00% above the threshold [0.0] [23:58:10] RECOVERY - Puppet run on tools-cron-01 is OK: OK: Less than 1.00% above the threshold [0.0] [23:58:22] RECOVERY - Puppet run on tools-exec-1412 is OK: OK: Less than 1.00% above the threshold [0.0] [23:59:26] (03CR) 10Volans: [V: 032 C: 032] "Dummy keys for keyholder" [labs/private] - 10https://gerrit.wikimedia.org/r/330600 (https://phabricator.wikimedia.org/T154588) (owner: 10Volans) [23:59:57] RECOVERY - Puppet run on tools-exec-1418 is OK: OK: Less than 1.00% above the threshold [0.0]