[10:05:09] New review: Dzahn; ""ploticus" plotting lib for stat1" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/3512 [10:05:12] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3512 [10:13:03] New patchset: Mark Bergsma; "Temporarily place bits.pmtpa behind bits.eqiad to test if sess leakage occurs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4996 [10:13:19] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4996 [10:13:56] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4996 [10:13:59] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4996 [10:18:41] New patchset: Mark Bergsma; "Stop doing translations on tier2 pmtpa hosts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4998 [10:19:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4998 [10:19:10] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4998 [10:19:13] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4998 [10:33:04] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [10:35:36] New patchset: Mark Bergsma; "Make pmtpa equal to esams" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5000 [10:35:52] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/5000 [10:35:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5000 [10:35:55] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5000 [10:39:43] 5000th! wee [10:53:10] PROBLEM - Puppet freshness on nfs2 is CRITICAL: Puppet has not run in the last 10 hours [10:57:04] PROBLEM - Puppet freshness on nfs1 is CRITICAL: Puppet has not run in the last 10 hours [10:57:04] PROBLEM - Puppet freshness on sq34 is CRITICAL: Puppet has not run in the last 10 hours [11:10:32] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [11:13:08] Reedy: you know a bit about refreshLinks.php ? [11:14:06] like performance impact when running it (eventually via cron on all wikis) [11:21:49] !running authdns-update to add textbook.wp entry [11:30:33] mutante: it shouldn't be too bad when doing it with --dfn-only [11:31:17] I suspect it should be run at 1 script per cluster [11:31:25] Reedy: looks like we are supposed to run a test of it together [11:31:37] lol [11:31:58] heh, just from the ticket :) [11:32:13] * Reedy tests on mww [11:32:23] real 0m11.913s [11:32:44] mutante: can you look into why puppet is not running on all these hosts? [11:33:03] mark: ok [11:37:34] !log nfs1 - Could not find class misc::mediawiki-logger for nfs1 [11:37:36] Logged the message, Master [11:41:17] PROBLEM - Host sq34 is DOWN: PING CRITICAL - Packet loss = 100% [11:42:20] PROBLEM - Router interfaces on cr2-pmtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.197 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [11:42:38] PROBLEM - Router interfaces on mr1-pmtpa is CRITICAL: CRITICAL: No response from remote host 10.1.2.3 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [11:42:38] PROBLEM - BGP status on cr2-pmtpa is CRITICAL: CRITICAL: No response from remote host 208.80.152.197, [11:42:56] PROBLEM - BGP status on cr2-eqiad is CRITICAL: CRITICAL: No response from remote host 208.80.154.197, [11:43:59] PROBLEM - HTTP on fenari is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:44:08] RECOVERY - BGP status on cr2-pmtpa is OK: OK: host 208.80.152.197, sessions up: 7, down: 0, shutdown: 0 [11:44:17] RECOVERY - BGP status on cr2-eqiad is OK: OK: host 208.80.154.197, sessions up: 19, down: 0, shutdown: 1 [11:44:22] !log sq34 was broken and died when connecting to mgmt, powercycling [11:44:24] Logged the message, Master [11:45:11] RECOVERY - Router interfaces on cr2-pmtpa is OK: OK: host 208.80.152.197, interfaces up: 89, down: 0, dormant: 0, excluded: 0, unused: 0 [11:45:19] Have I upset NFS/fenari again? [11:45:20] RECOVERY - HTTP on fenari is OK: HTTP OK HTTP/1.1 200 OK - 4416 bytes in 0.002 seconds [11:45:44] Back again [11:46:00] you've saturated the uplink of rack A4 again [11:46:03] we should really fix that [11:46:52] It seems it didn't copy the symlink, it copied a whole copy of the directory [11:46:53] ffs [11:48:44] !log sq34 - System halted! Error: Internal Storage Slot, powered down, -> RT [11:48:46] Logged the message, Master [11:51:41] ACKNOWLEDGEMENT - Host sq34 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn RT #2823 [11:53:08] mark: you already know the reason for puppet runs on amslvs, right [11:53:19] no [11:53:27] what is it? [11:53:28] Failed to parse template pybal/pybal.conf.erb: [11:53:48] undefined method `sort' for nil:NilClass at /var/lib/git/operations/puppet/manifests/lvs.pp:506 [11:54:20] the object in pybal.conf.erb which it sorts is not defined (nil) [11:54:29] can you figure out why? [11:55:16] i can try .. [11:56:08] RECOVERY - check_job_queue on spence is OK: JOBQUEUE OK - all job queues below 10,000 [11:57:11] RECOVERY - check_job_queue on neon is OK: JOBQUEUE OK - all job queues below 10,000 [12:04:42] New patchset: Mark Bergsma; "Call SES_Delete on sessions found to be closed during vca_return_session." [operations/debs/varnish] (patches/sess_leak_fix2) - https://gerrit.wikimedia.org/r/5005 [12:05:23] New review: Mark Bergsma; "(no comment)" [operations/debs/varnish] (patches/sess_leak_fix2); V: 1 C: 2; - https://gerrit.wikimedia.org/r/5005 [12:05:26] Change merged: Mark Bergsma; [operations/debs/varnish] (patches/sess_leak_fix2) - https://gerrit.wikimedia.org/r/5005 [12:06:32] !log Testing sess_leak_fix2 patch with a snapshot varnish build on cp3001 [12:06:35] Logged the message, Master [12:24:42] !log Sending European bits traffic back to esams [12:24:44] Logged the message, Master [12:46:46] PROBLEM - Packetloss_Average on emery is CRITICAL: CRITICAL: packet_loss_average is 9.56517362205 (gt 8.0) [13:22:07] PROBLEM - Puppet freshness on gilman is CRITICAL: Puppet has not run in the last 10 hours [13:22:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:24:04] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.553 seconds [13:27:58] !log Sending European bits traffic back to pmtpa [13:28:00] Logged the message, Master [13:58:07] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:00:49] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.873 seconds [14:33:58] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:37:35] mark: i want to work on db24 dimm issue can you bring down for mem test [14:37:39] !rt 2678 [14:37:39] https://rt.wikimedia.org/Ticket/Display.html?id=2678 [14:39:31] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 7.410 seconds [14:40:45] shutting down mysql [14:45:22] PROBLEM - mysqld processes on db24 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [14:46:23] cmjohnson1: going down [14:46:31] !log Shutdown db24 for memory testing by Chris [14:46:33] great....thx [14:46:34] Logged the message, Master [14:48:31] PROBLEM - Host db24 is DOWN: PING CRITICAL - Packet loss = 100% [14:56:26] New patchset: Dzahn; "logging.pp - update class name in system_role, remove nonexistent misc::mediawiki-logger from nodes nfs[12]" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5026 [14:56:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5026 [15:01:33] New review: Dzahn; "removing this from nfs[12] because the class does not exist anymore so puppet breaks. it may need th..." [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/5026 [15:01:36] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5026 [15:03:22] RECOVERY - Puppet freshness on nfs1 is OK: puppet ran at Mon Apr 16 15:03:02 UTC 2012 [15:04:25] RECOVERY - Puppet freshness on nfs2 is OK: puppet ran at Mon Apr 16 15:04:10 UTC 2012 [15:04:35] !log puppet fresh on nfs[12] after removing nonexistent misc::mediawiki-logger class [15:04:37] Logged the message, Master [15:06:43] mutante: https://rt.wikimedia.org/Ticket/Display.html?id=2355 -- could you and Reedy schedule something here? [15:06:57] Oh [15:06:59] I forgot about that [15:07:20] Hmm [15:07:22] mutante: real 165m47.429s [15:07:28] ^ That's how long it took to run on commons [15:07:39] hexmode: eh yeah, we talked a bit earlier, see above [15:07:39] Though it died [15:07:40] Error in fetchObject(): Lost connection to MySQL server during query (10.0.6.32) [15:07:50] * Reedy runs again [15:07:53] mutante: above how far? [15:07:57] the time output? [15:08:05] 4 hours or so [15:08:05] yes [15:08:40] ohhh does refresh links get the imagelinks table too? [15:09:23] i understand it as: the task is "add cronjob in puppet, but make it NOT run at the same time on all wikis" [15:10:19] and then there was "foreachwiki" vs. "mwscriptwikiset" to execute it [15:10:23] yeah [15:10:46] Like I said, we can probably get away with spawning it foreachwiki in cluster [15:11:16] It hits 9 different tables [15:11:41] SELECT DISTINCT( $field ) FROM $table LEFT JOIN page ON $field=page_id WHERE page_id IS NULL; [15:11:55] Reedy, mutante: could you update bugzilla: https://bugzilla.wikimedia.org/show_bug.cgi?id=16112 [15:12:50] 690891291 | Using index; Using temporary [15:12:56] oohhh it does! yay [15:12:59] 1 | Using where; Using index; Not exists; Distinct [15:14:04] so to get the basics together: which host would the (puppet) cron run on then? [15:14:10] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:14:14] Most likely hume [15:15:35] and we want several cronjobs at different times, calling the same mwscriptwikiset command, but with varying "dblist" files [15:17:03] I suggested doing it per db cluster it's mainly DB server intensive [15:17:08] *1 per db cluster [15:17:44] was at http://wikitech.wikimedia.org/view/Heterogeneous_deployment#Run_a_maintenance_script_on_a_group_of_wikis [15:18:36] yeah, exactly [15:18:40] we've dblists for each [15:18:54] ok [15:20:08] so literally you need 8 cron entries doing mwscriptwikiset refreshLinks.php sX.dblist --dfn-only [15:20:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 9.819 seconds [15:22:19] hexmode: Reedy: ok, i'll update tickets once i came up with a puppet change [15:23:35] Thanks [15:52:00] New patchset: Reedy; "Link rXXXXX to CodeReview" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5033 [15:52:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5033 [15:53:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:57:19] New review: Demon; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/5033 [16:02:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.035 seconds [16:03:24] New patchset: Reedy; "Link to RT tickets also!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5034 [16:03:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5034 [16:04:01] New patchset: Pyoungmeister; "forgot nfs1/2 :/" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5035 [16:04:17] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5035 [16:07:30] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5035 [16:07:33] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5035 [16:07:44] New review: Demon; "Just amend the other change?" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/5034 [16:07:53] !log upgrading and restarting udp2log on nfs1/2 [16:07:55] Logged the message, notpeter [16:10:49] !erb is to check the syntax of a puppet erb template: erb -x -T '-' mytemplate.erb | ruby -c [16:10:49] Key was added! [16:17:41] mark: what is the correct way to start demux.py? [16:18:02] on nfs1/2 [16:18:09] I'm not finding anything close to documentation.... [16:19:17] no idea [16:19:20] k [16:20:13] New patchset: Reedy; "Link rXXXXX to CodeReview" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5033 [16:20:30] Change abandoned: Reedy; "Merged with 5033" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5034 [16:20:30] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5033 [16:21:20] New review: Demon; "With the RT tickets, we probably want to put word boundaries on them. Otherwise you end up linking t..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/5033 [16:23:04] New patchset: Reedy; "Link rXXXXX to CodeReview" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5033 [16:23:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5033 [16:33:04] New patchset: Lcarr; "fixing puppetmaster templating" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5103 [16:33:20] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5103 [16:34:37] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:35:00] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5103 [16:35:04] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5103 [16:41:40] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.029 seconds [16:46:13] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3291 [16:55:57] New patchset: preilly; "Add ACL for carriers and redirect support for carriers landing page" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4916 [16:56:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/4916 [16:57:40] New review: preilly; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/4916 [16:59:31] PROBLEM - Varnish HTTP bits on cp3001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:01:29] New patchset: Dzahn; "class for mw cronjobs to run refreshLinks.php per cluster" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5104 [17:01:42] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/5104 [17:02:09] LeslieCarr: take a look at: https://gerrit.wikimedia.org/r/#change,4916 [17:03:01] Reedy: what do you know about udp2log? [17:03:17] New patchset: Dzahn; "class for mw cronjobs to run refreshLinks.php per cluster - fix var name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5104 [17:03:31] New review: gerrit2; "Change did not pass lint check. You will need to send an amended patchset for this (see: https://lab..." [operations/puppet] (production); V: -1 - https://gerrit.wikimedia.org/r/5104 [17:03:43] RECOVERY - Varnish HTTP bits on cp3001 is OK: HTTP OK HTTP/1.1 200 OK - 632 bytes in 0.218 seconds [17:04:00] notpeter: A few bits.. What are you going to ask me? :p [17:04:08] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/4916 [17:04:11] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4916 [17:04:40] I just want to know what "flush" does in the config [17:05:11] I upgraded to 1.8 on nfs1 [17:05:33] and demux.py started silently dying (or never launching) thus making it useless [17:05:44] when I removed "flush" from the conf line [17:05:46] it started working [17:05:49] and the bits are flowing [17:05:58] but I want to know wtf that means. [17:08:36] New patchset: Dzahn; "class for mw cronjobs to run refreshLinks.php per cluster - fix var name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5104 [17:08:53] New patchset: Lcarr; "fixing spelling error" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5105 [17:09:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5104 [17:09:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5105 [17:09:17] No idea, sorry [17:09:32] cool [17:09:32] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5105 [17:09:34] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5105 [17:11:20] New review: Dzahn; "see inline comments" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/5104 [17:12:07] RECOVERY - Varnish traffic logger on cp1041 is OK: PROCS OK: 2 processes with command name varnishncsa [17:14:58] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:16:16] !regsubst testing your regsubst replacings - https://blog.kumina.nl/2010/03/puppet-tipstricks-testing-your-regsubst-replacings-2/ [17:16:30] !regsubst is testing your regsubst replacings - https://blog.kumina.nl/2010/03/puppet-tipstricks-testing-your-regsubst-replacings-2/ [17:16:30] Key was added! [17:18:05] LeslieCarr: this is the RT ticket http://rt.wikimedia.org/Ticket/Display.html?id=2824 [17:18:12] thanks [17:19:17] New patchset: Pyoungmeister; ""flush" now depracted in udp2log" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5107 [17:19:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5107 [17:19:41] um, can spellcheck leslie fix that ? :) [17:20:04] PROBLEM - Puppet freshness on es1004 is CRITICAL: Puppet has not run in the last 10 hours [17:20:15] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5107 [17:20:17] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5107 [17:21:07] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [17:21:07] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [17:22:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 6.971 seconds [17:22:46] RECOVERY - Varnish traffic logger on cp1042 is OK: PROCS OK: 2 processes with command name varnishncsa [17:24:07] RECOVERY - Varnish traffic logger on cp1044 is OK: PROCS OK: 2 processes with command name varnishncsa [17:25:01] RECOVERY - Varnish traffic logger on cp1043 is OK: PROCS OK: 2 processes with command name varnishncsa [17:29:04] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [17:29:04] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [17:29:29] New review: Dzahn; "(no comment)" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/5104 [17:30:34] PROBLEM - LDAPS on nfs2 is CRITICAL: Connection refused [17:31:01] PROBLEM - LDAP on nfs2 is CRITICAL: Connection refused [17:32:04] RECOVERY - Varnish traffic logger on cp1022 is OK: PROCS OK: 2 processes with command name varnishncsa [17:32:13] RECOVERY - Varnish traffic logger on cp1023 is OK: PROCS OK: 2 processes with command name varnishncsa [17:32:58] RECOVERY - Varnish traffic logger on cp1025 is OK: PROCS OK: 2 processes with command name varnishncsa [17:32:58] RECOVERY - Varnish traffic logger on cp1026 is OK: PROCS OK: 2 processes with command name varnishncsa [17:33:07] RECOVERY - Varnish traffic logger on cp1032 is OK: PROCS OK: 2 processes with command name varnishncsa [17:33:34] that was me restarting logging daemon on those boxes [17:33:37] tired of seeing the errors :) [17:33:49] however, anyone know what's up with db1001, db1020 ? [17:33:52] RECOVERY - Varnish traffic logger on cp1033 is OK: PROCS OK: 2 processes with command name varnishncsa [17:34:19] RECOVERY - Varnish traffic logger on cp1034 is OK: PROCS OK: 2 processes with command name varnishncsa [17:34:28] RECOVERY - Varnish traffic logger on cp1036 is OK: PROCS OK: 2 processes with command name varnishncsa [17:37:04] LeslieCarr: know anything about the LDAP on nfs2 though? [17:37:17] no clue [17:37:18] i see it is opendj, but still running [17:37:28] restarting sounds like a good first step to me [17:38:30] !log restarting opendj on nfs2 because it refused connections [17:38:33] Logged the message, Master [17:41:09] !log LDAP on nfs2 warnings - opendj was _just_ started there when puppet was fixed with an unrelated issue [17:41:11] Logged the message, Master [17:48:53] Formey seems upset [17:49:13] Can't login, getting Disconnected: No supported authentication methods available (server sent: publickey) [17:49:18] Could someone bounce it please? [17:50:28] Oh [17:50:39] formey looks ok [17:50:48] New patchset: Pyoungmeister; "not quite a revert." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5110 [17:51:04] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5110 [17:51:08] Seems possibly broken since the ldap change [17:51:23] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5110 [17:51:26] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5110 [17:51:27] the one i just logged? [17:51:30] Can you login with non root? [17:51:50] It may be co-incidental [17:51:53] that looked like it did not run for quite a while, but puppet started running again on nfs2 [17:51:59] hmm [17:52:01] ok [17:53:51] yea, "invalid users" :o [17:55:06] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:59:22] Reedy: uhm, yea, the LDAP server on nfs2 starts "succesfully", and still this looks like not a coincidence.. [17:59:39] looking [17:59:48] If you stop it again, does it fix formey? [17:59:59] i _just_ tried exactly that. and no [18:00:31] haha [18:01:00] Where's ryan? [18:02:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 4.453 seconds [18:06:56] mutante: it seems a wiki about the size of Commons will take 3-4 hours to run refreshlinks [18:09:43] New patchset: Jgreen; "simplified thread-spawning code, fixed copyright statement" [operations/debs/wikimedia-search-qa] (master) - https://gerrit.wikimedia.org/r/5111 [18:10:38] New review: Jgreen; "(no comment)" [operations/debs/wikimedia-search-qa] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/5111 [18:10:40] Change merged: Jgreen; [operations/debs/wikimedia-search-qa] (master) - https://gerrit.wikimedia.org/r/5111 [18:13:58] !log upgrade of udp2log on nfs1/2 complete. should be operating normally now. [18:14:00] Logged the message, notpeter [18:28:56] Reedy: what specifically is broken? [18:29:34] Ryan_Lane: non roots can't login [18:30:14] where? [18:30:31] on formey? [18:30:56] Ryan_Lane: ya [18:34:33] RECOVERY - LDAP on nfs2 is OK: TCP OK - 0.001 second response time on port 389 [18:34:51] RECOVERY - LDAPS on nfs2 is OK: TCP OK - 0.000 second response time on port 636 [18:35:09] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:36:18] Reedy: fixed [18:36:20] ish [18:36:37] heh, cheers [18:37:03] !log manually added iptables nat rules on nfs2 [18:37:05] Logged the message, Master [18:37:23] (the ones from the init script, but that did not execute them for some reason) [18:37:29] mutante: you shouldn't do that [18:37:36] well, I guess it's likely ok [18:40:15] New patchset: Ryan Lane; "Setting the bind address properly" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5116 [18:40:31] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5116 [18:40:48] but looked like udp2log iptables rules removed those from the init script.. [18:41:08] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5116 [18:41:11] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5116 [18:41:46] aah [18:42:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.649 seconds [18:43:51] RECOVERY - LDAP on nfs1 is OK: TCP OK - 0.002 second response time on port 389 [18:44:36] RECOVERY - LDAPS on nfs1 is OK: TCP OK - 0.007 second response time on port 636 [19:04:52] New patchset: Dzahn; "decommission sq34" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5119 [19:05:08] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5119 [19:16:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:35] New patchset: Lcarr; "Revert "Add ACL for carriers and redirect support for carriers landing page"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5122 [19:22:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5122 [19:23:00] preilly: ^^ [19:23:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.028 seconds [19:23:22] New review: Lcarr; "this was only a test." [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5122 [19:23:25] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5122 [19:23:51] LeslieCarr: okay cool [19:24:00] good to merge and push it now ? [19:24:14] LeslieCarr: please let me know once it is pushed [19:24:18] LeslieCarr: it looks good [19:39:12] New review: Demon; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/5033 [19:43:42] Change abandoned: preilly; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/4032 [19:56:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:03:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.802 seconds [20:34:12] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [20:36:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:38:57] New patchset: Pyoungmeister; "incompatible (and useless) on nfs1/2" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5127 [20:39:13] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5119 [20:39:13] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5119 [20:39:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5127 [20:39:52] New patchset: Lcarr; "decom sq40" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5128 [20:40:09] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5127 [20:40:09] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5127 [20:40:09] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5128 [20:41:44] New patchset: Lcarr; "decom sq39 sq40 & sq46 per ticket http://rt.wikimedia.org/Ticket/Display.html?id=2581" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5128 [20:42:01] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5128 [20:42:15] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5128 [20:42:18] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5128 [20:43:48] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 1.513 seconds [20:49:05] LeslieCarr: how can I purge a nagios check at this point? [20:52:17] like of a decom server ? [20:52:22] or just on whatever ? [20:52:46] on whatever [20:54:24] same way you would before, if it doesn't go away on its own after the las tpuppet run on spence, you can sort of hack it by removing the file from /etc/nagios/puppet_checks.d on spence and rerunning puppet to make it rebuild [20:55:25] but no need to purge from db9? [20:55:36] (I'm looking in there and not seeing, which is why I ask) [21:05:45] no need to purge from db9 [21:07:00] cool! thanks! [21:11:06] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [21:16:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:23:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 3.419 seconds [21:24:46] hi maplebed [21:25:05] hi drdee. I'm only half here, btw. [21:25:17] do you know why emery was experiencing packetloss up to 18% but now it's going back to normal? [21:25:21] oh okay [21:25:26] I know nothing. [21:25:28] :D [21:26:16] :) [21:30:55] notpeter: is Oxygen ready for filters? [21:31:55] not yet. it doesn't have its packetloss graphing yet. [21:32:03] (as cronmail is telling us every minute) [21:32:20] also... the multicast relay and the udp2log instance want to listen on the same port [21:32:23] that needs to be sorted [21:32:28] but ones it is, it'll be good to go [21:32:29] so, soon [21:32:36] notpeter: what's the multicast relay? [21:32:53] is it the one that gets logs from remote colos? [21:33:37] tbh, not 100% sure. mark set it up [21:33:42] look at class misc::squid-logging::multicast-relay for deets [21:34:00] it's either that or something to do with general multicast logging, in which case we can probably kill it. [21:34:19] notpeter: thanks, i think we have reached the max nr. of filters on emery as we have experienced packetloss 3x in the last week [21:35:17] drdee: okie dokie. it's at the front of my queue [21:35:27] maplebed: I think that it's new.... not sure [21:35:33] notpeter: awesome! [21:36:56] RECOVERY - Packetloss_Average on emery is OK: OK: packet_loss_average is 3.6270171875 [21:40:54] oh i think the multicast relay takes the purge multicast signals, makes them unicast, then sends them to/from AMS back into multicast [21:41:01] do not want to remove that if it's the piece i think it is [21:45:37] I'm pretty sure we have a relay that does exactly that yes [21:45:39] that's not supposed to run in multiple places though, is it? [21:45:42] For HTCP purges [21:45:50] yes, we certainly have a relay that does that. [21:45:55] but is that the relay that's running on oxygen? [21:46:10] I don't think it is, or if it is, I think it's a redundant copy that shouldn't be running. [21:46:22] that's what we need to find out before killing it. [21:47:06] tcpdump to the rescue? [21:47:21] no can do; there's 30MB of traffic on that port. [21:49:52] There must be some way you can see what it's sending and to whom [21:56:22] New patchset: Catrope; "Fixes for l10nupdate" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3885 [21:56:39] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3885 [21:56:43] maplebed: about packetloss, my theory is that packetloss is triggered when data generates 'positives' on many different filters. if it was just the number of filters running that causes packetloss then we would see a linear increase of packetloss as new filters are deployed but that's not the case. Packet loss happens in bursts or bellshapes, does that make sense or am I missing [21:56:43] something? [21:57:25] New review: Catrope; "(no comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3885 [21:57:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:58:36] You'de have to get Tim's take on that. If the buffer for a filter hasn't been emptied by the time you got through all other filters, then maybe. [21:59:27] gotta go. bbiab. [22:06:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.036 seconds [22:25:17] maplebed: do you have a list of all wikis you think should be sharded? [22:25:31] at the moment or in general? [22:25:35] now: en and commons. [22:25:41] in general: maybe also de and fr. [22:26:08] * AaronSchulz is thinking maybe 2 years ahead [22:26:38] Many containers is better than many objects. Run extrapolations: anything over 100k objects: shard it. [22:26:39] ;) [22:38:34] drdee, robla: so has anybody rolled back the filters we put in place last week yet? [22:38:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:38:57] (also, drdee - it's probably best to monitor the host for 1d rather than 1h after installing new filters given the cyclic nature of log volume) [22:40:24] http://ganglia.wikimedia.org/latest/?r=week&cs=&ce=&m=&c=Miscellaneous+pmtpa&h=emery.wikimedia.org&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS makes it look like it's going to keep happening until they go away. [22:40:34] maplebed: not yet. looking back at the deploy log, I wonder if notpeter 's puppetization work may be at play [22:40:48] I doubt it. [22:40:54] w [22:40:59] (always possible, but the config file looks right) [22:41:06] paravoid: r [22:41:08] heh [22:41:14] flaky internet around here [22:41:43] ah, I see.... [22:41:52] works for you? [22:42:28] robla: April 11th I pushed out new filters for drdee [22:42:42] yeah, I see that now [22:43:03] paravoid: well enough. [22:43:57] maplebed, robla: well the first packetloss happened approx 24 hours after the deployment of the filters [22:44:01] http://twitter.com/DEVOPS_BORAT/statuses/119489376374886400 [22:44:04] heh. [22:44:14] drdee: that also has happened before [22:44:25] it's a known pattern with udp2log [22:44:30] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 2.033 seconds [22:44:43] drdee: given the packetloss graph (week view: http://ganglia.wikimedia.org/latest/graph.php?r=week&z=xlarge&c=Miscellaneous+pmtpa&h=emery.wikimedia.org&v=0.637890393701&m=packet_loss_average&jr=&js=&vl=%25 ) [22:44:48] robla: i wasn't aware, sorry [22:44:49] it's pretty clear that it's part of a daily cycle. [22:45:06] we deployed after the peak of the curve on the 11th; makes sense it wouldn't start for a while on the 12t. [22:45:09] 12th. [22:45:43] again, my apologies, the solution is to move some filters to oxygen and not to deploy new filters on emery. [22:46:08] quite likely as a long term solution. short term, we have to undeploy [22:46:10] so who wants to make the call on which filters to can? [22:46:17] i'll make the call [22:46:32] we can undeploy th three most recent wikipedia zero filters [22:47:04] tfinc: fyi ^ [22:47:09] drdee: do you want to prep a puppet change in gerrit? [22:47:24] sure, i'll just comment them if that's okay [22:47:40] also I'm likely to drop off the net soon (battery's almost dead) - any opsen wanna take point for the undeploy? (aka push drdee's puppet change to emery) [22:47:56] drdee: just make sure that amit & dan know [22:48:01] drdee: s/comment/commit/ , presumably [22:48:15] robla: thanks for the heads up [22:48:16] oh...comment them out [22:48:21] robla; yes [22:48:28] and push :) [22:48:36] its important that we stay on top of these dips in our collected traffic [22:48:38] :) [22:49:38] notpeter: are you still around to push drdee's change if I drop off? or LeslieCarr maybe? [22:52:11] woosters: ^ can you figure someone out? this is pretty important that we do this today [22:52:21] nagios is currently watching the packetloss metric; it's currently configured to notify IRC only. [22:52:33] i'm around [22:52:43] \o/ [22:53:18] LeslieCarr: nothing more complicated than https://gerrit.wikimedia.org/r/#change,4758 [22:53:26] New patchset: Diederik; "Undeploy three Wikipedia Zero filters" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5137 [22:53:29] (that was the one from last week) [22:53:42] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5137 [22:53:52] ... aaaand now I gotta bail. [22:53:52] ;) [22:53:56] i just added a comment so we can easily redeploy them on oxygen [22:54:41] drdee: my opinion - that won't solve the problem, but uncommenting the necromancy stuff will. [22:54:51] IIRC the pipe 1 vs. pipe 10 makes a world of difference. [22:54:53] robla - will get Lesliecarr to look at it [22:55:34] drdee: what you were talking about re: the amount of time for the filter to run - the necro filters must empty their pipe for every packet; the zero filters have a 10 packet break before they're called again. [22:55:37] s/10/9/ [22:56:33] I wasn't involved in that filter :) [22:56:46] but we can decrease the sampling rate [22:57:31] drdee: hey [22:57:41] was on phone, now context switching back :) [22:57:58] hey [22:58:18] need me to check this out right now or does in 30 minutes work ? [22:58:33] hold on [22:58:52] let me amemd this commit [22:59:33] LeslieCarr: I think 30 min works as long as it gets done before tomorrow morning in Europe [22:59:46] (sunrise not midnight) [22:59:47] ok, i'll be back in a minute then [23:01:17] New patchset: Diederik; "Undeploy three Wikipedia Zero filters Reduce sampling rate to 1 in 10." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5137 [23:01:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5137 [23:01:46] I ammended the commit, and reduced the sampling rate for Ryan's filters as well [23:11:21] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5137 [23:11:24] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5137 [23:11:27] drdee: pushing this to emery and locke now [23:11:37] locke? [23:11:42] locke is not involved [23:11:45] just emery [23:17:00] LeslieCarr: i just got message that I can disable two more filters so I'll send another commit, that way we are sure that emery will be fine [23:17:27] New patchset: Diederik; "Disabling E3 filters" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5140 [23:17:43] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/5140 [23:18:26] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/5140 [23:18:29] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/5140 [23:18:34] hrm, faulkner experimenting with necromancy …. [23:18:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:20:41] LeslieCarr: okay if I leave? [23:20:57] i'm updating emery and locke right now ... [23:21:00] this should not create issues [23:21:03] ok [23:21:03] hehe [23:21:11] we are undeploying after al :) [23:21:12] is that a challenge? ;) [23:21:19] yeah should be fine [23:21:23] no no :) [23:21:36] thanks! [23:21:47] <^demon|away> this should not cause issues [23:21:48] <^demon|away> Don't say that! [23:21:50] drdee: can you check in on the packet loss first thing tomorrow? [23:22:16] robla: yes, but we disabled 5 filters so that should really make a difference [23:22:25] drdee: the right 5? [23:22:26] :) [23:22:36] PROBLEM - Puppet freshness on gilman is CRITICAL: Puppet has not run in the last 10 hours [23:22:41] robla: i guess we will see that tomorrow ;) [23:22:50] no i am sure the right ones. [23:23:06] drdee: is there anything that is still going be running that wasn't running April 10? [23:23:13] no [23:23:28] ok...that's pretty likely to fix it then [23:23:59] there's usually a single culprit [23:24:46] or at least, things frequently skew 95%, 2%, 1%, 1%, 1% [23:25:02] robla: yeah i think the two filters without sampling might have been the culprits [23:25:45] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 335 bytes in 0.019 seconds [23:46:25] PROBLEM - Host amslvs4 is DOWN: PING CRITICAL - Packet loss = 100% [23:47:55] PROBLEM - BGP status on csw2-esams is CRITICAL: CRITICAL: host 91.198.174.244, sessions up: 3, down: 1, shutdown: 0BRPeering with AS64600 not established - BR [23:59:10] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds