[00:01:33] RECOVERY - NTP on mw31 is OK: NTP OK: Offset 0.002007722855 secs [00:02:19] (03PS3) 10Brion VIBBER: Fix popup video size by ordering transcode settings properly [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/115094 [00:02:32] PROBLEM - gitblit.wikimedia.org on antimony is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Server Error - 1703 bytes in 6.498 second response time [00:02:32] there, now has right bug number in the comment :) [00:09:32] RECOVERY - gitblit.wikimedia.org on antimony is OK: HTTP OK: HTTP/1.1 200 OK - 188683 bytes in 6.772 second response time [00:16:05] (03CR) 10Ori.livneh: "Weekly ping." [operations/puppet] - 10https://gerrit.wikimedia.org/r/112314 (owner: 10Ori.livneh) [00:39:51] (Cannot contact the database server: Too many connections (10.64.16.7)) [01:05:42] PROBLEM - Host mw72 is DOWN: PING CRITICAL - Packet loss = 100% [01:06:15] !log tstarling synchronized php-1.23wmf14/includes/SiteStats.php [01:06:32] RECOVERY - Host mw72 is UP: PING OK - Packet loss = 0%, RTA = 35.40 ms [01:08:09] !log tstarling updated /a/common/php-1.23wmf15 to {{Gerrit|I268599be9}}: [1.23wmf15] Make SiteStats (re)initializing more sane [01:08:49] !log tstarling synchronized php-1.23wmf15/includes/SiteStats.php [01:13:12] PROBLEM - puppetmaster https on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:18:02] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.168 second response time [01:38:12] PROBLEM - puppetmaster https on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:44:02] PROBLEM - HTTP on virt0 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:52:39] wikitech is down? [01:57:29] I'm looking at it. I don't know what's happening yet. [01:57:37] Thanks. [01:57:44] legoktm: Yeah, down for me as well. [01:57:53] thanks andrewbogott [01:59:52] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Last successful Puppet run was Fri 21 Feb 2014 04:42:42 PM UTC [02:01:02] RECOVERY - HTTP on virt0 is OK: HTTP OK: HTTP/1.1 302 Found - 457 bytes in 5.257 second response time [02:03:50] well… that didn't teach me anything [02:05:02] RECOVERY - puppetmaster https on virt0 is OK: HTTP OK: Status line output matched 400 - 336 bytes in 3.276 second response time [02:20:21] Gloria, legoktm, looking better now? [02:20:46] andrewbogott: yes! thanks [02:20:51] ^ [02:31:50] !log LocalisationUpdate completed (1.23wmf14) at 2014-02-24 02:31:50+00:00 [02:32:00] Logged the message, Master [02:44:44] !log LocalisationUpdate completed (1.23wmf15) at 2014-02-24 02:44:44+00:00 [02:44:53] Logged the message, Master [02:47:52] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Last successful Puppet run was Sat 22 Feb 2014 02:36:40 PM UTC [03:27:55] !log LocalisationUpdate ResourceLoader cache refresh completed at 2014-02-24 03:27:55+00:00 [03:28:03] Logged the message, Master [04:55:49] (03PS3) 10Hoo man: Don't use a hard coded -D in the sql utility script [operations/puppet] - 10https://gerrit.wikimedia.org/r/113661 [04:58:10] (03CR) 10Springle: [C: 032] Don't use a hard coded -D in the sql utility script [operations/puppet] - 10https://gerrit.wikimedia.org/r/113661 (owner: 10Hoo man) [05:00:52] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Last successful Puppet run was Fri 21 Feb 2014 04:42:42 PM UTC [05:11:16] (03PS1) 10Physikerwelt: Add texlive-lang-greek [operations/puppet] - 10https://gerrit.wikimedia.org/r/115102 [05:12:09] (03PS2) 10Physikerwelt: Add texlive-lang-greek [operations/puppet] - 10https://gerrit.wikimedia.org/r/115102 [05:13:48] (03PS3) 10Physikerwelt: Add texlive-lang-greek [operations/puppet] - 10https://gerrit.wikimedia.org/r/115102 [05:48:52] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Last successful Puppet run was Sat 22 Feb 2014 02:36:40 PM UTC [06:28:53] PROBLEM - Disk space on virt11 is CRITICAL: DISK CRITICAL - free space: /var/lib/nova/instances 44138 MB (3% inode=99%): [08:01:52] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Last successful Puppet run was Fri 21 Feb 2014 04:42:42 PM UTC [08:18:09] (03PS1) 10Ori.livneh: txStatsD: add dependency on python-twisted-web [operations/puppet] - 10https://gerrit.wikimedia.org/r/115121 [08:18:41] (03CR) 10Ori.livneh: [C: 032] txStatsD: add dependency on python-twisted-web [operations/puppet] - 10https://gerrit.wikimedia.org/r/115121 (owner: 10Ori.livneh) [08:19:24] (03CR) 10Ori.livneh: [V: 032] txStatsD: add dependency on python-twisted-web [operations/puppet] - 10https://gerrit.wikimedia.org/r/115121 (owner: 10Ori.livneh) [08:49:52] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Last successful Puppet run was Sat 22 Feb 2014 02:36:40 PM UTC [10:42:06] (03PS5) 10BryanDavis: Send Vary header on http to http redirect [operations/puppet] - 10https://gerrit.wikimedia.org/r/111917 [10:42:11] (03CR) 10Faidon Liambotis: [C: 032 V: 032] Send Vary header on http to http redirect [operations/puppet] - 10https://gerrit.wikimedia.org/r/111917 (owner: 10BryanDavis) [10:42:24] eep [10:42:30] mid-flight collision? [10:42:48] ah, no, the bot being stupid [10:42:58] (03PS1) 10Hashar: Describe Math related packages in a class [operations/puppet] - 10https://gerrit.wikimedia.org/r/115133 [10:45:55] (03PS1) 10Hashar: Move math related packages to a puppet class [operations/debs/wikimedia-task-appserver] - 10https://gerrit.wikimedia.org/r/115135 [10:46:37] (03PS2) 10Hashar: Describe Math related packages in a class [operations/puppet] - 10https://gerrit.wikimedia.org/r/115133 [10:47:25] (03CR) 10Physikerwelt: [C: 031] Describe Math related packages in a class [operations/puppet] - 10https://gerrit.wikimedia.org/r/115133 (owner: 10Hashar) [10:47:57] (03Abandoned) 10Physikerwelt: Add texlive-lang-greek [operations/puppet] - 10https://gerrit.wikimedia.org/r/115102 (owner: 10Physikerwelt) [10:47:59] (03CR) 10Hashar: "In production the texvc dependencies are installed via wikimedia-task-appserver package which we can't really install on contint servers " [operations/puppet] - 10https://gerrit.wikimedia.org/r/115102 (owner: 10Physikerwelt) [10:49:49] (03CR) 10Hashar: "ocaml is not included since I do not think we need to push it on all application servers. I have left ocaml in the wikimedia-task-appserve" [operations/puppet] - 10https://gerrit.wikimedia.org/r/115133 (owner: 10Hashar) [10:50:52] (03CR) 10Hashar: "ocaml is left around, I dont think we have to push it to all application servers though." [operations/debs/wikimedia-task-appserver] - 10https://gerrit.wikimedia.org/r/115135 (owner: 10Hashar) [10:55:20] (03CR) 10Hashar: "I have installed the packages manually on both contint servers (gallium and lanthanum) using:" [operations/puppet] - 10https://gerrit.wikimedia.org/r/115133 (owner: 10Hashar) [10:55:44] (03CR) 10Physikerwelt: [C: 031] "I like that. I.e if we move to mathoid we only need to swap out the math class" [operations/debs/wikimedia-task-appserver] - 10https://gerrit.wikimedia.org/r/115135 (owner: 10Hashar) [11:50:52] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Last successful Puppet run was Sat 22 Feb 2014 02:36:40 PM UTC [11:52:52] akosiaris: so how did you manage to rebuild the archives for february while apparently leaving the older ones intact? I see January has numbers in the 120xxx and February in the 70xxx http://lists.wikimedia.org/pipermail/wikimedia-l/2014-January/ [11:54:03] Nemo_bis: Ι restored up to February, February was not that easy restorable unfortunately. [11:54:55] I 'd like to understand at some point how those numbers get generated btw [11:55:39] I'm sure many would [11:56:04] So you just overwrote the regenerated archives with the backup? [12:02:43] Nemo_bis: replace, not overwrite but yeah [12:24:57] bd808|BUFFER: you've got a failing Elasticsearch health check on logstash [12:31:18] looks like the cluster has a split brain [12:32:02] you can tell because the health checks logs in icinga don't show the same number of failed shards [12:55:18] (03PS1) 10Manybubbles: Log Elasticsearch hot_threads [operations/puppet] - 10https://gerrit.wikimedia.org/r/115151 [13:02:13] (03CR) 10Manybubbles: "I did eventually figure out what was wrong with my puppet tester machine and verify this on it. Filed the startup failure upstream: https" [operations/puppet] - 10https://gerrit.wikimedia.org/r/115151 (owner: 10Manybubbles) [13:25:54] (03PS1) 10Odder: Crats should not add users to import on frwiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/115152 [13:28:54] (03PS1) 10Odder: Enable web fonts by default on Hebrew Wikisource [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/115153 [13:29:39] (03PS1) 10Alexandros Kosiaris: DHCP configuration for private1-d-eqiad [operations/puppet] - 10https://gerrit.wikimedia.org/r/115154 [13:30:08] akosiaris: public too? [13:31:07] paravoid: aren't we gonna do the ulsfo thing ? public IPs being just service ips ? [13:32:01] ulsfo has a public subnet too [13:32:12] yeah but no dhcp there [13:32:55] and we'll probably need some public-facing servers in row D as well? [13:33:03] why not? [13:33:14] hmmm no we got.. do we use it is a question [13:33:20] anyway I will add it for completeness [13:33:28] yeah we do [13:33:36] how else would we install e.g. bast4001? :) [13:33:46] that is the only thing we used it for [13:33:53] IIRC [13:33:53] right ? [13:34:54] probably [13:36:18] (03CR) 10Nemo bis: [C: 031] "Two users currently in group, they can get them removed any time at [[m:SRP]] if needed." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/115152 (owner: 10Odder) [13:38:24] (03CR) 10KartikMistry: [C: 031] "LGTM" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/115153 (owner: 10Odder) [14:03:52] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Last successful Puppet run was Fri 21 Feb 2014 04:42:42 PM UTC [14:04:05] (03CR) 10Faidon Liambotis: [C: 032] "LGTM. We really need to template the shit out of these configs, don't we :)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/115154 (owner: 10Alexandros Kosiaris) [14:13:03] (03PS1) 10Hashar: contint: bring elasticsearch for browsertests [operations/puppet] - 10https://gerrit.wikimedia.org/r/115162 [14:13:39] (03CR) 10jenkins-bot: [V: 04-1] contint: bring elasticsearch for browsertests [operations/puppet] - 10https://gerrit.wikimedia.org/r/115162 (owner: 10Hashar) [14:15:35] (03CR) 10Manybubbles: contint: bring elasticsearch for browsertests (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/115162 (owner: 10Hashar) [14:34:26] (03CR) 10Ottomata: Log Elasticsearch hot_threads (034 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/115151 (owner: 10Manybubbles) [14:38:25] (03CR) 10Quentinv57: [C: 031] Crats should not add users to import on frwiktionary [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/115152 (owner: 10Odder) [14:50:58] Reedy: Around? [14:51:52] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Last successful Puppet run was Sat 22 Feb 2014 02:36:40 PM UTC [15:06:19] (03PS2) 10Manybubbles: contint: bring elasticsearch for browsertests [operations/puppet] - 10https://gerrit.wikimedia.org/r/115162 (owner: 10Hashar) [15:19:51] (03PS3) 10Hashar: contint: bring elasticsearch for browsertests [operations/puppet] - 10https://gerrit.wikimedia.org/r/115162 [15:22:36] Coren: tests passed :-] [15:23:15] (03CR) 10coren: [C: 032] "Hashar promises this is good. :-)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/115162 (owner: 10Hashar) [15:24:34] bahh [15:25:13] Duplicate definition: File[/var/lib/redis] is already defined in file /etc/puppet/manifests/role/ci.pp at line 275; cannot redefine at /etc/puppet/modules/redis/manifests/init.pp:38 [15:25:16] I hate you puppet :] [15:27:25] (03PS1) 10Hashar: contint: get redis under /mnt/redis [operations/puppet] - 10https://gerrit.wikimedia.org/r/115174 [15:27:32] Coren: sorry lame follow up https://gerrit.wikimedia.org/r/115174 [15:27:55] Coren: we created a File resource which is already created by the Redis class => Duplicate definition error [15:34:08] (03CR) 10coren: [C: 032] contint: get redis under /mnt/redis [operations/puppet] - 10https://gerrit.wikimedia.org/r/115174 (owner: 10Hashar) [15:35:44] thanks and sorry Coren :( [15:36:08] Sorry for what? [15:38:34] cause there is another one :D [15:39:01] (03PS1) 10Hashar: contint: require => needs File[], not just the path [operations/puppet] - 10https://gerrit.wikimedia.org/r/115175 [15:39:34] and I did rearview it :( [15:39:34] require => '/mnt/elasticsearch' [15:39:34] does not work :/ [15:39:35] need File[] around it hehe https://gerrit.wikimedia.org/r/115175 [15:40:59] (03CR) 10Amire80: [C: 031] "+1 for the code and the functionality, but over to Ori for performance approval. This will make the Hebrew Wikisource load webfonts by def" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/115153 (owner: 10Odder) [15:46:43] Coren: mind getting a File[] error in as well please https://gerrit.wikimedia.org/r/115175 ? :( [15:47:12] hey uh [15:47:13] is there a #wikimedia-search or something ? [15:47:13] hashar: ... I thought that was testes? :-) [15:47:16] Coren: na only manually reviewed sorry :-( [15:47:18] missed a bunch :( [15:49:12] hey hashar [15:49:13] one dayI will figure out how to compile catalogs on all nodes [15:49:13] (03CR) 10coren: [C: 032] contint: require => needs File[], not just the path [operations/puppet] - 10https://gerrit.wikimedia.org/r/115175 (owner: 10Hashar) [15:49:13] Antoine is there a wikimedia-search channel ? [15:49:13] average: you can find them on -dev [15:49:35] ah ok [15:49:51] Coren: catalog finally compiled!! :] [16:12:28] (03PS1) 10Ricordisamoa: localize wmgBabelCategoryNames and wmgBabelMainCategory for oswiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/115181 [16:16:51] !log Restarted elasticsearch on logstash1001 [16:16:58] Logged the message, Master [16:23:12] (03PS2) 10Faidon Liambotis: Remove streber from DNS (decom) [operations/dns] - 10https://gerrit.wikimedia.org/r/112669 (owner: 10Matanya) [16:23:12] (03PS3) 10Matanya: Remove streber from DNS (decom) [operations/dns] - 10https://gerrit.wikimedia.org/r/112669 [16:23:16] hrm, grrrit-wm is confused lately [16:24:19] paravoid: you mean wrong author being reported? there's a bug for that [16:24:57] paravoid: i'd love if you could get someone to prioritize it, as it affects more than just grrrit-wm [16:24:59] paravoid: https://bugzilla.wikimedia.org/show_bug.cgi?id=60781 [16:26:04] ^d or qchris would be our best bets [16:26:51] <^d> bleh gerrit. [16:38:39] !log Logstash elasticsearch split-brain resulted in loss of all logs for 2014-02-24 from 00:00Z to ~16:30Z [16:38:47] Logged the message, Master [16:39:14] oh no [16:39:30] greg-g: Non-critical service :) [16:39:37] But yeah not good [16:40:44] yeah, better now than after we sell people more on depending on it ;) [16:43:37] http://ganglia.wikimedia.org/latest/?c=Text%20caches%20eqiad&h=cp1067.eqiad.wmnet&m=cpu_report&r=hour&s=by%20name&hc=4&mc=2 [16:43:40] bleh [17:06:07] (03CR) 10Ricordisamoa: [C: 031] Set wmgBabelCategoryNames for Chinese Wikivoyage [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114970 (owner: 10Odder) [17:30:43] (03PS1) 10BryanDavis: logstash: Increase shard replica count [operations/puppet] - 10https://gerrit.wikimedia.org/r/115204 [17:39:48] (03CR) 10Manybubbles: [C: 04-1] logstash: Increase shard replica count (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/115204 (owner: 10BryanDavis) [17:43:18] (03PS2) 10BryanDavis: logstash: Increase shard replica count [operations/puppet] - 10https://gerrit.wikimedia.org/r/115204 [17:43:30] (03CR) 10BryanDavis: logstash: Increase shard replica count (031 comment) [operations/puppet] - 10https://gerrit.wikimedia.org/r/115204 (owner: 10BryanDavis) [17:44:23] (03CR) 10Faidon Liambotis: [C: 032] logstash: Increase shard replica count [operations/puppet] - 10https://gerrit.wikimedia.org/r/115204 (owner: 10BryanDavis) [17:44:50] paravoid: Thanks. [17:45:03] n [17:45:04] np [17:52:52] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Last successful Puppet run was Sat 22 Feb 2014 02:36:40 PM UTC [17:52:55] (03CR) 10Faidon Liambotis: [C: 04-1] "Minor comment. Apart from that, passing it to Coren :)" (031 comment) [operations/apache-config] - 10https://gerrit.wikimedia.org/r/109237 (owner: 10Tim Landscheidt) [17:53:01] Coren: ^^ [17:54:17] (03CR) 10Faidon Liambotis: "Alex, how do you think we should fix this? Maybe while we're deliberating the monitoring part, we could split this and restore the sysctl " [operations/puppet] - 10https://gerrit.wikimedia.org/r/111163 (owner: 10Ori.livneh) [17:54:26] paravoid: Noted. [17:54:39] :) [17:54:51] you can even fix the tab issue yourself and push it if you're feeling up to ti [17:54:55] *it [17:55:25] Not while migrating I don't; I'm not going to send that traffic to a webserver I'm messing with. :-) [17:55:47] it currently serves a "Domain not configured", so I don't think it's any better [17:55:59] but sure, I don't particularly care about this [17:56:02] Ah, good point I suppose. They already shut down the ts half? [17:56:13] I have no idea, which is why I passed the whole thing to you :) [17:56:17] * Coren chuckles. [17:56:35] I'll coordinate with TIm. [18:00:53] PROBLEM - DPKG on virt1001 is CRITICAL: Connection refused by host [18:01:00] (03PS2) 10Manybubbles: Log Elasticsearch hot_threads [operations/puppet] - 10https://gerrit.wikimedia.org/r/115151 [18:01:02] PROBLEM - Disk space on virt1001 is CRITICAL: Connection refused by host [18:01:02] PROBLEM - RAID on virt1001 is CRITICAL: Connection refused by host [18:01:32] PROBLEM - puppet disabled on virt1001 is CRITICAL: Connection refused by host [18:01:34] (03CR) 10Manybubbles: Log Elasticsearch hot_threads (034 comments) [operations/puppet] - 10https://gerrit.wikimedia.org/r/115151 (owner: 10Manybubbles) [18:07:02] RECOVERY - Disk space on virt1001 is OK: DISK OK [18:07:02] RECOVERY - RAID on virt1001 is OK: OK: Active: 16, Working: 16, Failed: 0, Spare: 0 [18:07:32] RECOVERY - puppet disabled on virt1001 is OK: OK [18:07:52] RECOVERY - DPKG on virt1001 is OK: All packages OK [18:11:04] (03CR) 10Faidon Liambotis: [C: 031] "LGTM. See inline for some minor stuff." (032 comments) [operations/puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/110650 (owner: 10Ottomata) [18:16:25] (03CR) 10Matanya: [C: 031] "Ah! thanks for clarifying this, good job." [operations/puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/110650 (owner: 10Ottomata) [18:23:33] Coren: the certificate expiration alert is still there; I recall you fighting it, any luck? [18:25:09] paravoid: I had to roll back my dirty fix because it broke other things. I'm sure I could fix it given the time but, to be honest, right now I have a set of more pressing demands on my time. :-/ I'll take another wack at it next time I'm stuck waiting on something. [18:25:48] (03CR) 10Ottomata: [C: 031] "U no like elasticsearch::log::hot_threads or even elasticsearch::hot_threads_log? Either of those are fine. ::hot_threads is fine with m" [operations/puppet] - 10https://gerrit.wikimedia.org/r/115151 (owner: 10Manybubbles) [18:26:11] (03PS3) 10Ottomata: Kafkatee puppet module [operations/puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/110650 [18:26:59] (03CR) 10Manybubbles: "Huh?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/115151 (owner: 10Manybubbles) [18:27:14] manybubbles: [18:27:23] you changed it to elasticsearch::hot_threads, right? [18:27:31] now the class name doesn't say log at all [18:27:38] but that's ok by me if that's what you really wanted! [18:28:22] Coren: that certificate should have been revoked, no? and if so, how is it trusted by the machines that use LDAP? [18:29:01] paravoid: They have the cert itself as explicit trust, and I doubt they check the CRLs [18:29:14] nice... [18:29:23] (03PS3) 10Manybubbles: Log Elasticsearch hot_threads [operations/puppet] - 10https://gerrit.wikimedia.org/r/115151 [18:29:36] so we did revoke a possibly compromised certificate but still treat it as trusted internally to carry all of our passwords [18:29:39] Most SSL clients are implemented by blind idiots. [18:30:49] Actually, I'd wager that most of the clients don't check for certificate validity at all. [18:32:23] (03PS4) 10Manybubbles: Log Elasticsearch hot_threads [operations/puppet] - 10https://gerrit.wikimedia.org/r/115151 [18:36:24] (03CR) 10Ottomata: [C: 032 V: 032] Kafkatee puppet module [operations/puppet/kafkatee] - 10https://gerrit.wikimedia.org/r/110650 (owner: 10Ottomata) [18:39:46] (03PS1) 10Faidon Liambotis: check_cert: catch SSLErrors and print them as such [operations/puppet] - 10https://gerrit.wikimedia.org/r/115211 [18:40:27] (03CR) 10Faidon Liambotis: [C: 032 V: 032] check_cert: catch SSLErrors and print them as such [operations/puppet] - 10https://gerrit.wikimedia.org/r/115211 (owner: 10Faidon Liambotis) [18:41:24] manybubbles: :) but also uhrm, more annoying thing [18:41:25] the class name needs to match the file name [18:41:25] so if we do ::log::hot_threads [18:41:25] the file needs to be at [18:41:32] PROBLEM - Certificate expiration on virt0 is CRITICAL: SSL error: [Errno 1] _ssl.c:504: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed [18:41:52] PROBLEM - Certificate expiration on virt1000 is CRITICAL: SSL error: [Errno 1] _ssl.c:504: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed [18:44:34] Coren: I'm confused, I don't see any related puppet changes from you [18:44:51] you do know that we do all the certificate handling via puppet, right? [18:46:05] paravoid: Ostensibly, yes. There isn't a puppet way to handle changing the opendj configuration to fiddle with keystores though; so all my fixes were done live to first attempt to reliably determine the right method. [18:47:48] paravoid: The only thing you can do in puppet atm is general a PCKS12 certificate/key pair which the LDAP server will then ignore. :-) [18:48:03] sounds like *that* needs fixing [18:48:30] It does. [18:49:01] it'd be cool if you gave it a look soon-ish though; it's a real problem [18:49:16] also, 17 out of 26 service alerts in icinga are labs-related :) [18:50:49] paravoid: *sigh* I know, dude. And I agree with you 200%. But trying to move labs to eqiad without using /too much/ duct tape and bailing wire is draining what little time both me and Andrew got. [18:51:49] paravoid: Once I got tools running there; pressure is going to release a bit and I'll have a better chance to hack at backlog. [18:52:00] FWIW, it's going reasonably well. [19:06:59] (03PS1) 10Jforrester: Make VisualEditor opt-out on Portuguese Wikibooks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/115213 [19:07:33] (03CR) 10Catrope: [C: 031] Make VisualEditor opt-out on Portuguese Wikibooks [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/115213 (owner: 10Jforrester) [19:18:13] ottomata: busy? [19:18:37] in meeting, ja [19:18:40] ottomata: didn't I move it? [19:19:01] please ping me later if you find time ottomata [19:29:58] !log remmoting virt1001 (sick stuck on bad mounts) [19:30:05] Logged the message, Master [19:31:22] PROBLEM - Host virt1001 is DOWN: PING CRITICAL - Packet loss = 100% [19:33:21] (03CR) 10Merlijn van Deen: "The domain is owned by WMNL, and they are also handling the DNS. If, maintenance-wise, it's preferrable to have DNS here, I can ask them t" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/109237 (owner: 10Tim Landscheidt) [19:34:42] RECOVERY - Host virt1001 is UP: PING OK - Packet loss = 0%, RTA = 1.31 ms [19:37:58] (03CR) 10Hashar: "Cant we point pywikibot.org directly to tools labs ? :-D" [operations/apache-config] - 10https://gerrit.wikimedia.org/r/109237 (owner: 10Tim Landscheidt) [19:42:15] (03CR) 10Tim Landscheidt: "@hashar: That would require a more complex configuration; also, soon(TM), tools.wmflabs.org will be handled by dynamic proxy (?), so this " [operations/apache-config] - 10https://gerrit.wikimedia.org/r/109237 (owner: 10Tim Landscheidt) [20:01:36] (03PS1) 10Dereckson: Add skipcaptcha right for all sysops on labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/115230 [20:05:52] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Last successful Puppet run was Fri 21 Feb 2014 04:42:42 PM UTC [20:11:01] (03CR) 10Steinsplitter: [C: 031] Add skipcaptcha right for all sysops on labs [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/115230 (owner: 10Dereckson) [20:21:00] (03PS1) 10Dzahn: remove db9, decom [operations/dns] - 10https://gerrit.wikimedia.org/r/115241 [20:27:10] mutante: did you pick at the extra cron of bugzilla reporter? [20:27:19] matanya: yes, done [20:27:33] thanks [20:28:03] 6894 [20:28:58] (03PS1) 10Dzahn: remove kaulen, decom [operations/dns] - 10https://gerrit.wikimedia.org/r/115243 [20:29:21] (03CR) 10Dzahn: [C: 04-1] "let me just copy some old data before" [operations/dns] - 10https://gerrit.wikimedia.org/r/115243 (owner: 10Dzahn) [20:29:54] matanya: you'll notice stat1 user removals are merged..but maybe the ticket isnt updated yet [20:31:18] manybubbles: want me to merge hot_threads? [20:31:26] fine with me! [20:31:51] (03PS5) 10Manybubbles: Log Elasticsearch hot_threads [operations/puppet] - 10https://gerrit.wikimedia.org/r/115151 [20:31:51] i have mutante stuff had been hectic @ $day_job lately [20:31:56] (03CR) 10Ottomata: [C: 032 V: 032] Log Elasticsearch hot_threads [operations/puppet] - 10https://gerrit.wikimedia.org/r/115151 (owner: 10Manybubbles) [20:32:07] i will update later, hope [20:32:32] matanya: no worries, heh, actual job is first!:) [20:33:13] (03PS1) 10Dzahn: remove kaulen from puppet,dsh,dhcp [operations/puppet] - 10https://gerrit.wikimedia.org/r/115246 [20:33:31] !log Restarted elasticsearch on logstash1002 in attempt to clear stuck reallocations likely caused by OOM while running recovery [20:33:40] Logged the message, Master [20:34:32] k done, manybubbles [20:34:32] :) [20:34:37] thanks! [20:39:33] MatmaRex, paravoid: The fix for https://bugzilla.wikimedia.org/show_bug.cgi?id=60781 is basically ready. Want to test a few remaining edge cases and then bribe ^d to deploy it. [20:39:37] mutante: ticket updated [20:40:44] qchris: <3 [20:53:52] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Last successful Puppet run was Sat 22 Feb 2014 02:36:40 PM UTC [21:00:02] RECOVERY - ElasticSearch health check on logstash1001 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 36: active_shards: 72: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [21:00:22] RECOVERY - ElasticSearch health check on logstash1002 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 36: active_shards: 72: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [21:00:32] RECOVERY - ElasticSearch health check on logstash1003 is OK: OK - elasticsearch (production-logstash-eqiad) is running. status: green: timed_out: false: number_of_nodes: 3: number_of_data_nodes: 3: active_primary_shards: 36: active_shards: 72: relocating_shards: 2: initializing_shards: 0: unassigned_shards: 0 [21:03:49] starting the parsoid deploy.. [21:17:41] !log restarting parsoid on wtp1002 [21:17:49] Logged the message, Master [21:18:03] (03PS1) 10RobH: techblog.wikimedia.org to get its own certificate [operations/puppet] - 10https://gerrit.wikimedia.org/r/115298 [21:19:21] (03PS2) 10RobH: techblog.wikimedia.org to get its own certificate [operations/puppet] - 10https://gerrit.wikimedia.org/r/115298 [21:19:37] greg-g: when parsoid is done, can someone help deploy our update? (seems reedy is not around) [21:19:40] mutante: sharing knowledge: i had to set a user today with access to a system, but he was not allowed to get an interactive shell, and had to run two service commands (start/stop). how would you do that? :) [21:20:07] would be great to have opportunity to poke around on test.wikidata again tomorrow to confirm all issues are fixed [21:20:19] ^d: can you deploy a wikidata update to test.wikidata? [21:20:26] (he's not at his desk right now) [21:20:31] sure, thanks [21:20:36] (03PS1) 10Jforrester: Enable VE in the "Recherche:" (104) namespace for frwikiversity [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/115299 [21:20:42] I need to go run a quick errand, brb [21:20:50] * ^d is in meeting [21:20:53] k [21:22:16] https://gerrit.wikimedia.org/r/#/c/115208/ for reference [21:22:30] (03CR) 10RobH: [C: 032] techblog.wikimedia.org to get its own certificate [operations/puppet] - 10https://gerrit.wikimedia.org/r/115298 (owner: 10RobH) [21:23:57] !log deployed Parsoid 51c71eb / deploy b684fea [21:24:04] Logged the message, Master [21:24:07] aude, we are done [21:24:32] gwicke: great [21:24:57] !log updating blog apache configs to use techblog.w.o https cert [21:25:06] Logged the message, RobH [21:26:38] !log techblog.w.o redirects now work without certificate errors [21:26:46] Logged the message, RobH [21:28:58] anybody around with knowledge of how service-restart works? [21:29:24] we got an early timeout from salt during the parsoid deploy, which seems to be a salt misconfiguration [21:31:22] gwicke: i just found there is also [21:31:24] salt.modules.upstart.force_reload(name) [21:31:38] http://salt.readthedocs.org/en/latest/ref/modules/all/salt.modules.upstart.html [21:31:54] I see cmd = ("sudo salt-call -l quiet --out json publish.runner " [21:31:54] "deploy.restart '{0}','{1}'") [21:32:04] in /usr/local/bin/service-restart [21:32:05] on tin [21:34:59] from http://docs.saltstack.com/ref/cli/salt.html: -t TIMEOUT, default 5 [21:35:46] I think I got the timeout around 5 seconds [21:36:09] parsoid needs about 90 seconds at most for a graceful restart [21:44:24] !log kaulen - stopping services, disabling monitoring [21:44:32] Logged the message, Master [21:44:52] * ^d responds to an aude and greg-g ping. [21:44:54] <^d> Deploy? [21:45:04] yep [21:45:20] ^d: https://gerrit.wikimedia.org/r/115208 [21:45:29] (03PS6) 10Jforrester: Enable Parsoid's edit caching on all public wikis [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114100 [21:45:30] ^d: yep [21:45:35] <^d> Just wmf15? [21:45:37] fixes test.wikidata [21:46:09] (03CR) 10Jforrester: "PS6 has the commit message updated to list the most significant change. Parsoid consider this good to go." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/114100 (owner: 10Jforrester) [21:46:23] * ^d +2s and waits for jenkins [21:46:30] ok [21:48:26] (03CR) 10Dzahn: [C: 032] remove kaulen from puppet,dsh,dhcp [operations/puppet] - 10https://gerrit.wikimedia.org/r/115246 (owner: 10Dzahn) [21:49:43] !log aaron synchronized php-1.23wmf15/includes/filebackend 'efb1e99fdf1e91bef4fc086b945b7933049e2a50' [21:49:52] Logged the message, Master [21:50:20] !log aaron synchronized php-1.23wmf15/includes/filebackend 'd52a8af6c2f730d017e87e7217b6a0b299ab85be' [21:50:28] Logged the message, Master [21:51:46] !log demon synchronized php-1.23wmf15/extensions/Wikidata 'Updating wikidata build to fix test.wikidata' [21:51:54] Logged the message, Master [21:51:58] <^d> aude, hoo: ^ [21:52:03] :) [21:52:06] thanks [21:52:12] * aude checks test.wikidata [21:53:09] looks good [21:58:06] !log updated blog.w.o to wp3.8.1 [21:58:13] Logged the message, RobH [22:00:09] mutante: for the sake of completence - i used ssh force command [22:01:01] matanya: for what [22:02:02] mutante: sharing knowledge: i had to set a user today with access to a system, but he was not allowed to get an interactive shell, and had to run two service commands (start/stop). how would you do that? :) [22:02:38] matanya, ssh key with forced command [22:02:41] * hoo knows that prob. [22:03:03] yeah, see above :) took me like 10 minutes to think of it [22:03:15] * Platonides notes that's what matanya said three lines above [22:04:14] root@deployment-cache-bits03:/home/maxsem# varnishadm ban.url . && varnishadm -n frontend ban.url . [22:04:14] Cannot open /var/lib/varnish/frontend/_.vsm: No such file or directory [22:04:14] Could not open shared memory [22:04:22] what's wrong^^? [22:05:03] matanya: aha, thanks for sharing [22:07:07] * gwicke created https://bugzilla.wikimedia.org/show_bug.cgi?id=61882 for the salt timeout issue mentioned above [22:09:33] Anybody have time to do a quick config tweak? https://gerrit.wikimedia.org/r/#/c/115094/ [22:09:42] ^ fixes size of video player popup [22:44:30] (03PS1) 10RobH: icinga-admin to use own cert, not wildcard [operations/puppet] - 10https://gerrit.wikimedia.org/r/115315 [22:46:41] (03PS2) 10RobH: icinga-admin to use own cert, not wildcard [operations/puppet] - 10https://gerrit.wikimedia.org/r/115315 [22:48:09] (03CR) 10Tim Landscheidt: Redirect pywikipedia.org to Tools (031 comment) [operations/apache-config] - 10https://gerrit.wikimedia.org/r/109237 (owner: 10Tim Landscheidt) [22:48:27] (03PS3) 10Tim Landscheidt: Redirect pywikipedia.org to Tools [operations/apache-config] - 10https://gerrit.wikimedia.org/r/109237 [22:48:53] (03CR) 10RobH: [C: 032] icinga-admin to use own cert, not wildcard [operations/puppet] - 10https://gerrit.wikimedia.org/r/115315 (owner: 10RobH) [22:50:01] !log kaulen - shutdown -h now [22:50:08] Logged the message, Master [22:52:08] Kaulen is no more... [22:55:57] !log kaulen - revoke puppet cert, revoke salt key, stored configs,... [22:56:00] :-( [22:56:04] Logged the message, Master [22:56:13] mutante: congratulations on a job well done! [22:56:14] hashar: did you want it? [22:56:23] hashar: thank you: [22:56:25] mutante: na just being sentimental about it :D [22:56:48] maybe the wmf museum could use it [22:57:19] heh :P What hardware is it? [22:57:31] One of the Poweredge 1950s? [22:59:35] i think you missed the better one for being sentimental [22:59:40] locke :) [22:59:54] which used to be db3 or something even before that [23:00:09] locke did a bunch of shit [23:00:12] HW type:Dell PowerEdge R300 [23:00:13] but was dead when i started [23:00:16] just sittin in rack,heh [23:00:18] kaulen wasn't even _that_ old [23:00:23] nope [23:00:28] compared to locke or db9 [23:00:34] I guess people will cry about fenari :P [23:00:35] db9 can also go, btw [23:00:41] also a r300 [23:01:00] !log icinga and icinga-admin now using their own certs [23:01:07] Logged the message, RobH [23:01:22] 23:42 mutante: shutting down locke - killing 757 days of uptime and one more Tampa classic host [23:01:25] hoo: we cried after Zwinger :-} fenari is already replaced by tin nowadays anyway :D [23:01:27] ^ that was more uptime for sure:) [23:01:44] hashar: Yeah... but it's still there ... :P [23:02:04] hashar: not ...ehmm.. really.. ehm.. [23:02:10] locke, page created by Robh on 19th December 2006‎ https://wikitech.wikimedia.org/w/index.php?title=Locke&action=history [23:02:29] hashar: noc/sync-apache/dsh groups/ ... [23:02:39] hashar: doesnt mean much [23:02:53] RobH: it is surely at least that old! [23:03:10] yea, it was aroudn before i started [23:03:19] it was already a dead server on my first day [23:03:29] that page history is broken [23:03:34] as it was some other page beofre then [23:03:34] :-) [23:03:40] odd hisotyr [23:03:42] history even [23:03:45] because it used to be db6 [23:03:50] https://racktables.wikimedia.org/index.php?page=object&object_id=426 [23:03:54] thats wrong dude [23:03:59] it was never db6 [23:04:09] db6 was a 2950 [23:04:20] locke was some old non dell custom build server [23:04:26] that was missing disk trays and a pdu [23:04:38] no clue why racktables thinks that [23:04:49] oh maybe im misrecalling [23:05:00] if you shut something down im misrecalling is all [23:05:02] disregard. [23:05:04] anyway bed time for me [23:05:15] what the hell was the machine i was gonna donate to tim.. [23:05:18] i no longer recall [23:05:45] :D What does will he use it for? As a compile box? [23:05:57] PROBLEM - Puppet freshness on virt1000 is CRITICAL: Last successful Puppet run was Fri 21 Feb 2014 04:42:42 PM UTC [23:16:55] (03PS1) 10RobH: ishmael.wikimedia.org to use its own cert, not wildcard [operations/puppet] - 10https://gerrit.wikimedia.org/r/115318 [23:18:21] (03CR) 10RobH: [C: 032] ishmael.wikimedia.org to use its own cert, not wildcard [operations/puppet] - 10https://gerrit.wikimedia.org/r/115318 (owner: 10RobH) [23:18:55] !log updating ishmael to use its own ssl cert [23:19:03] Logged the message, RobH [23:29:07] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (201183) [23:32:45] huh [23:32:53] wtf does one do to fix that job queue issue? [23:33:29] depends :P [23:33:30] !log all neon services no longer using wildcard, and wildcard shredded off system [23:33:38] Logged the message, RobH [23:33:46] yea, gotta see what jobs.. checking otu tthe wikitech docs [23:34:15] RobH: heh... I'll also have quick look 00:30 over here, though [23:35:08] RECOVERY - check_job_queue on terbium is OK: JOBQUEUE OK - all job queues below 200,000 [23:35:15] (03CR) 10Dzahn: [C: 032] remove kaulen, decom [operations/dns] - 10https://gerrit.wikimedia.org/r/115243 (owner: 10Dzahn) [23:35:54] !log DNS update - removing kaulen [23:35:58] or it can fix itself. [23:36:02] Logged the message, Master [23:36:07] RobH: Or that :P [23:36:18] mutante: just dont remove mgmt dns until its wiped! [23:37:18] RobH: i had discussions about that with cmjohnson and also Jeff he says they don't need DNS for wiping [23:37:27] i also thought that earlier [23:37:29] nah, they dont to do it remotely [23:37:32] but i changed it after talking to them [23:37:36] to do so NON remotely [23:37:37] yea, that [23:37:41] but we have issues with hands on in tampa [23:37:49] so its easier to leave mgmt for remote wipe imo. [23:39:55] but doesnt really matter [23:40:46] it's ok, then we just need another step to remove mgmt later [23:40:48] 14:02 not even mgmt needed [23:40:48] 14:03 yeah apparently they do it at the console instead of pxe-style [23:41:40] i'm just going to add reviewers and not self-merge anymore, solved [23:43:07] PROBLEM - check_job_queue on terbium is CRITICAL: JOBQUEUE CRITICAL - the following wikis have more than 199,999 jobs: , Total (200631) [23:53:57] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Last successful Puppet run was Sat 22 Feb 2014 02:36:40 PM UTC