[00:00:17] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42205 [00:00:35] binasher: None of the 3 new databases I added are there on db64 [00:00:40] Looks like it's not replicating them.. [00:00:46] New patchset: Asher; "pull db64 from s3" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42207 [00:00:49] db64 is totally broken [00:01:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.926 seconds [00:01:03] strong words [00:01:07] i think it wasn't actually in production until 18:54:48 [00:01:28] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42207 [00:02:17] fail [00:02:44] !log taking down kuo, its power settings are freaking it out, will fix and bring back into servie [00:02:54] Logged the message, RobH [00:02:56] !log asher synchronized wmf-config/db.php 'pulling db64 from s3' [00:03:06] Logged the message, Master [00:03:37] https://gerrit.wikimedia.org/r/gitweb?p=operations/mediawiki-config.git;a=history;f=wmf-config/db.php;h=c35cad88eb45beba1ee4188e28d4a92863e4c76f;hb=HEAD [00:07:09] RECOVERY - MySQL Slave Running on db64 is OK: OK replication [00:07:45] RECOVERY - MySQL Replication Heartbeat on db64 is OK: OK replication delay seconds [00:10:00] PROBLEM - mysqld processes on db64 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [00:10:27] RECOVERY - MySQL Slave Delay on es1002 is OK: OK replication delay NULL seconds [00:10:37] RECOVERY - MySQL Slave Delay on es1004 is OK: OK replication delay NULL seconds [00:11:41] !log rebuilding db64 via hotbackup of db1010 [00:11:49] Logged the message, Master [00:12:06] Reedy: yeah, that's why i was wondering if db.php on disk on fenari was out of sync with git [00:12:57] !log converting es1001 to innodb [00:13:05] Logged the message, notpeter [00:13:55] Mmmm [00:14:03] PROBLEM - mysqld processes on es1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [00:14:06] I'm guessing it must've been [00:14:21] huh, tim killed queries on it on 12/27 according to the server admin log [00:14:30] PROBLEM - Host kuo is DOWN: PING CRITICAL - Packet loss = 100% [00:14:51] and it looks like replication on it broke at Dec 27 08:53 [00:15:03] so maybe it was sorta pulled then [00:15:18] doesn't explain ptwikivoyage though [00:15:51] !log modifying kuo bios power profile from DAPC to OS performance per watt setting in attempt to resolve runaway process issue [00:16:00] Logged the message, RobH [00:19:00] RECOVERY - Host kuo is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [00:19:03] binasher: https://rt.wikimedia.org/Ticket/Display.html?id=4105 [00:19:57] !log starting innobackupex from es1002 to es1001 [00:20:07] Logged the message, notpeter [00:35:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:37:36] !log bsitu Finished syncing Wikimedia installation... : Update Echo to Master [00:37:44] Logged the message, Master [00:43:32] New patchset: Pyoungmeister; "swapping db61 and db62 to coredb stuffs for testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42208 [00:45:40] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42208 [00:50:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.027 seconds [01:08:03] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [01:08:39] New patchset: Ryan Lane; "Only define the directories if they aren't already" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42209 [01:09:10] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42209 [01:11:39] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 291 seconds [01:12:03] New patchset: Ryan Lane; "Fix deploy state directory requirement" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42210 [01:16:01] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [01:16:01] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [01:16:01] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [01:16:54] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 23 seconds [01:17:47] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42210 [01:23:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:33:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.414 seconds [01:38:58] Change abandoned: Ryan Lane; "Handled by a newer change" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39201 [01:40:07] New patchset: Ryan Lane; "Sort all hashes used in templates" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42212 [01:41:35] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42212 [01:53:44] New patchset: Andrew Bogott; "Work around a bug with TLS in opendj." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42213 [01:56:25] andrewbogott: one inline comment [01:56:53] I just made the privs match the files as created by the .deb [01:57:01] yeah [01:57:07] All the files in that dir are -rw-r--r-- [01:57:16] yep [01:57:26] but we like to make files owned by puppet read only [01:57:36] Makes sense. [01:58:53] New patchset: Andrew Bogott; "Work around a bug with TLS in opendj." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42213 [02:01:13] New patchset: Ryan Lane; "Fix template location" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42214 [02:01:55] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42214 [02:04:31] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42213 [02:04:37] andrewbogott: ^^ [02:04:43] thanks [02:07:09] yw [02:07:59] Ryan_Lane: There's another bigger, less pressing patch awaiting your review. [02:08:07] oh? [02:08:21] https://gerrit.wikimedia.org/r/#/c/40344/ ? [02:08:22] https://gerrit.wikimedia.org/r/#/c/40344/ [02:08:25] * Ryan_Lane nods [02:08:28] Yeah, from a while ago. [02:08:42] It's too big to actually read :( [02:08:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:10:18] if we're really lucky I haven't edited them manually on a bunch of systems :D [02:10:25] I doubt that, though [02:10:37] andrewbogott: I +1'd it [02:10:45] when you're ready to merge it in and such, go for it [02:10:51] ok, thanks. [02:11:01] Yeah, I worry that they'll clobber all kinds of local customizations here and there :( [02:12:14] I almost always moved my changes into svn [02:12:19] unless it was a total hack [02:12:32] then I fixed the hack and moved it into svn [02:12:46] labs-nfs1 had some hacks that never got fixed [02:12:49] same with labstore2 [02:12:57] the only other system that uses the scripts is formey [02:13:03] and people rarely use them [02:14:07] So we're probably pretty safe -- those files were duped from labstore2, and labs-nfs1 is mostly obsolete [02:14:27] * Ryan_Lane nods [02:14:30] agreed [02:14:53] the only script we really use on formey is the one to modify groups [02:15:01] so I doubt anything bad will happen [02:15:33] PROBLEM - MySQL Replication Heartbeat on db1010 is CRITICAL: CRIT replication delay 254 seconds [02:15:42] PROBLEM - MySQL Slave Delay on db1010 is CRITICAL: CRIT replication delay 262 seconds [02:19:09] RECOVERY - MySQL Slave Delay on db1010 is OK: OK replication delay 0 seconds [02:19:18] RECOVERY - MySQL Replication Heartbeat on db1010 is OK: OK replication delay 1 seconds [02:20:47] !log truncated ( de | it )wikivoyage.searchindex tables [02:20:59] Logged the message, Master [02:22:46] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.024 seconds [02:24:15] RECOVERY - mysqld processes on db64 is OK: PROCS OK: 1 process with command name mysqld [02:26:54] !log LocalisationUpdate completed (1.21wmf6) at Fri Jan 4 02:26:53 UTC 2013 [02:27:04] Logged the message, Master [02:27:46] !log restarted replication on db64 from db34 [02:27:57] Logged the message, Master [02:29:03] PROBLEM - MySQL Replication Heartbeat on db64 is CRITICAL: CRIT replication delay 1038 seconds [02:32:03] PROBLEM - MySQL Slave Delay on db64 is CRITICAL: CRIT replication delay 1050 seconds [02:37:03] Ryan_Lane, java.security isn't getting applied on virt0. What am I missing? [02:38:52] andrewbogott: in which way do you mean? [02:38:57] RECOVERY - MySQL Slave Delay on db64 is OK: OK replication delay 0 seconds [02:39:01] the file isn't? [02:39:06] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [02:39:06] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [02:39:10] applied or installed? [02:39:22] right… /etc/java-6-openjdk/security/java.security [02:39:29] the file there is still the original one, not the one from puppet [02:39:32] ah [02:39:40] Not sure if I mean 'applied' or 'installed' [02:39:40] heh [02:39:43] RECOVERY - MySQL Replication Heartbeat on db64 is OK: OK replication delay 0 seconds [02:39:43] 'installed' I guess [02:39:45] you didn't merge it in ;) [02:39:57] I just did so for you [02:40:05] oh, on sockpuppet? [02:40:19] yep [02:40:23] Weird, I did rebase on a 2nd machine and saw the change. I wonder what... [02:41:42] Oh, of course, my local system is pulling from gerrit not from sockpuppet. [02:41:54] Sheesh, definitely time to call it a day [02:41:59] :D [02:45:39] New patchset: Asher; "Revert "pull db64 from s3"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42215 [02:46:03] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42215 [02:46:09] RECOVERY - Puppet freshness on neon is OK: puppet ran at Fri Jan 4 02:46:06 UTC 2013 [02:47:33] !log asher synchronized wmf-config/db.php 'returning db64 to s3' [02:47:42] Logged the message, Master [02:49:15] !log LocalisationUpdate completed (1.21wmf7) at Fri Jan 4 02:49:14 UTC 2013 [02:49:23] Logged the message, Master [02:49:44] Well, I still can't log in but the traffic looks right at least. [02:49:48] More of this tomorrow I guess [02:59:39] Question about the creations of our three new wikis. How do you change the interwiki map of a wiki? e.g. switch from Wikipedia standard interwiki map to the wikisource one? [03:09:39] Dereckson: There's only one interwiki map for all Wikimedia wikis, as far as I know. [03:09:47] It can be read via the API or via Special:Interwiki. [03:11:36] 21:02:56 < Busy_importing> Dereckson: It seems to me that as.ws has the wrong interwiki configuration? (wikipedia) http://as.wikisource.org/wiki/Special:Interwiki [03:13:04] (busy_importing being MF-Warburg) [03:15:51] Dereckson: I don't see an issue. [03:44:57] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 187 seconds [03:45:24] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 204 seconds [03:50:12] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [03:50:40] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [04:01:00] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [04:01:01] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [04:01:01] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [04:01:01] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [04:01:01] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [04:01:01] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [04:32:21] RECOVERY - swift-account-reaper on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [04:37:26] PROBLEM - swift-account-reaper on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [04:37:45] PROBLEM - mysqld processes on db62 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [05:08:20] PROBLEM - Puppet freshness on solr2 is CRITICAL: Puppet has not run in the last 10 hours [05:10:26] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [05:16:18] PROBLEM - Auth DNS on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [05:20:20] PROBLEM - Puppet freshness on solr3 is CRITICAL: Puppet has not run in the last 10 hours [05:20:21] PROBLEM - Puppet freshness on solr1003 is CRITICAL: Puppet has not run in the last 10 hours [05:21:23] PROBLEM - Puppet freshness on solr1001 is CRITICAL: Puppet has not run in the last 10 hours [05:25:26] PROBLEM - Full LVS Snapshot on db66 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:25:27] PROBLEM - mysqld processes on db66 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:25:27] PROBLEM - MySQL Idle Transactions on db66 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:25:27] PROBLEM - MySQL Slave Running on db66 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:27:05] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 8.743 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [05:27:06] RECOVERY - MySQL Idle Transactions on db66 is OK: OK longest blocking idle transaction sleeps for 0 seconds [05:27:06] RECOVERY - MySQL Slave Running on db66 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [05:27:06] RECOVERY - mysqld processes on db66 is OK: PROCS OK: 1 process with command name mysqld [05:27:06] RECOVERY - Full LVS Snapshot on db66 is OK: OK no full LVM snapshot volumes [05:51:51] PROBLEM - Auth DNS on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [05:53:30] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 7.230 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [05:59:04] PROBLEM - Auth DNS on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [06:13:18] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 5.047 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [07:14:57] New patchset: Dereckson; "(bug 43617) FlaggedRevs configuration for pl.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42217 [07:42:04] what? :o [07:43:56] ah [07:47:56] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [08:18:41] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [08:20:11] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [08:25:08] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [08:34:53] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [08:42:41] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.057 second response time [08:53:56] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:55:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.173 seconds [09:20:56] PROBLEM - Puppet freshness on silver is CRITICAL: Puppet has not run in the last 10 hours [09:20:56] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [09:29:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:43:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.043 seconds [10:15:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:25:22] New patchset: Aude; "(bug 43630) set autoconfirm count to 10 @ fawiki, per consensus" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42232 [10:28:08] RECOVERY - swift-account-reaper on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [10:29:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.062 seconds [10:33:24] PROBLEM - swift-account-reaper on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [10:58:44] PROBLEM - MySQL Slave Delay on db1005 is CRITICAL: CRIT replication delay 192 seconds [10:59:29] PROBLEM - MySQL Replication Heartbeat on db1005 is CRITICAL: CRIT replication delay 204 seconds [11:02:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:08:57] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [11:13:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.360 seconds [11:16:53] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [11:16:54] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [11:16:54] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [11:47:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:03:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.045 seconds [12:32:30] RECOVERY - MySQL Replication Heartbeat on db1005 is OK: OK replication delay 0 seconds [12:33:14] RECOVERY - MySQL Slave Delay on db1005 is OK: OK replication delay 0 seconds [12:35:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:39:50] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [12:39:50] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [12:45:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.474 seconds [13:20:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.207 seconds [14:01:53] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [14:01:54] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [14:01:54] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [14:01:54] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [14:01:54] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [14:01:54] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [14:06:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:23:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.035 seconds [14:23:48] Can someone please run: chown mwdeploy /home/wikipedia/common/wmf-config/ExtensionMessages-1.21wmf7.php [14:25:38] !log demon Started syncing Wikimedia installation... : [14:25:48] Logged the message, Master [14:55:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:07:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.045 seconds [15:08:39] !log demon Finished syncing Wikimedia installation... : [15:08:48] Logged the message, Master [15:08:53] <^demon> Yay, about time. [15:09:50] PROBLEM - Puppet freshness on solr2 is CRITICAL: Puppet has not run in the last 10 hours [15:09:57] :) [15:11:56] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [15:15:40] So, next step is to push wikidatawiki back to wmf7? [15:17:27] aude: ^ [15:17:32] -tech got busy ;) [15:18:03] brb, just gotta get a cable [15:19:26] yea [15:19:28] yes [15:21:50] PROBLEM - Puppet freshness on solr1003 is CRITICAL: Puppet has not run in the last 10 hours [15:21:50] PROBLEM - Puppet freshness on solr3 is CRITICAL: Puppet has not run in the last 10 hours [15:22:53] PROBLEM - Puppet freshness on solr1001 is CRITICAL: Puppet has not run in the last 10 hours [15:28:07] ok [15:28:59] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikidatawiki to 1.21wmf7 [15:29:05] yay! [15:29:07] Logged the message, Master [15:29:23] Everything not broken? ;) [15:29:26] looks good [15:29:35] * aude tries making an item [15:29:59] Have the bots been on holiday? [15:30:07] ignore that [15:33:10] works for me :) [15:33:47] Yay [15:34:34] thanks Reedy and ^demon for helping with this [15:34:44] Chad did most of it [15:34:45] :) [15:36:15] now we have a way to kick the memcached for the sites stuff, so hope we don't see such bugs again [15:40:00] Whaaaa [15:40:01] -rw-r--r-- 1 dzahn wikidev 2642 Oct 18 02:32 index.php [15:41:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:42:33] New patchset: Reedy; "Split dblists into their own subheading" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42269 [15:45:34] New patchset: Reedy; "wikidatawiki to 1.21wmf7" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42270 [15:45:52] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42270 [15:46:37] New patchset: Reedy; "Split dblists into their own subheading" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42269 [15:47:48] New patchset: Reedy; "Split dblists into their own subheading" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42269 [15:53:48] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.047 seconds [16:03:48] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [16:27:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:39:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.040 seconds [16:57:48] PROBLEM - MySQL Replication Heartbeat on db66 is CRITICAL: CRIT replication delay 303 seconds [16:58:42] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.003 second response time on port 11000 [17:01:25] RECOVERY - MySQL Replication Heartbeat on db66 is OK: OK replication delay 0 seconds [17:12:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:28:04] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.026 seconds [17:49:21] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [17:55:10] New patchset: Dzahn; "planet: have one theme config per language" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42277 [17:57:34] reedy@fenari:/home/wikipedia/common/docroot/noc/conf$ ls -al | grep dzahn [17:57:34] -rw-r--r-- 1 dzahn wikidev 9168 Oct 18 02:32 fc-list [17:57:34] -rw-r--r-- 1 dzahn wikidev 2642 Oct 18 02:32 index.php [17:57:42] mutante: ^ could you g+w those for me please? [17:57:51] Reedy: looks legit. that dzahn guy is alright [17:58:06] heh [17:58:55] Reedy: done [17:58:58] thanks [17:59:04] no prob! [17:59:18] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42269 [17:59:23] heh :) [17:59:41] i am not even sure when that happened [17:59:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:00:18] Probably when we were moving the docroots about [18:00:20] https://noc.wikimedia.org/conf/ [18:00:23] ^ dblist is slightly nicer [18:00:43] Anyone know of a better image (cylinder on a piece of paper?) for the dblists? [18:02:06] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 182 seconds [18:02:44] I wonder where those images even came from.. [18:02:50] Krinkle|detached: ^^ Did you add them? [18:03:36] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 225 seconds [18:14:07] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.041 seconds [18:18:11] Reedy: http://commons.wikimedia.org/wiki/Category:Database_icons [18:18:26] http://commons.wikimedia.org/wiki/File:Database-mysql.svg that? [18:20:16] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42277 [18:23:29] New patchset: RobH; "adding tarin back into poolcounter pool" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42280 [18:24:30] well... pre-puppet that change would have synced out and crashed poolcounter service... [18:24:33] yay for jenkins [18:25:03] New patchset: RobH; "adding tarin back into poolcounter pool" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42280 [18:25:03] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [18:25:12] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [18:28:30] New review: RobH; "this isnt the self review you are looking for" [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/42280 [18:28:44] New review: RobH; "this isnt the self review you are looking for" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/42280 [18:28:44] Change merged: RobH; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42280 [18:29:21] !log robh synchronized wmf-config/PoolCounterSettings.php [18:29:31] Logged the message, Master [18:29:55] re-pool successful \o/ [18:33:38] New review: Dereckson; "shellpolicy" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/33390 [18:34:40] sbernardin1: https://rt.wikimedia.org/Ticket/Display.html?id=4032 the ticket for the intel SFP [18:34:48] so if they are in, sign, scan, attach packign slip to ticket and resolve [18:35:17] RobH: doing that now... [18:35:32] cool, i hadnt assigned ticket for you [18:35:35] so just lettin ya know [18:36:16] cmjohnson1: just assigned the eqiad sfp ticket to you for same =] [18:36:18] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [18:36:32] k [18:37:12] <-- stat1: Failed to parse template redis/redis.conf.erb: Could not find value for 'redis_replication' [18:41:45] cmjohnson1: Ok, the labsdb saga [18:41:52] I'll recap what we have so we're on same page. [18:41:56] k [18:42:00] (I was able to poke asher and ryan about it since they sit by me) [18:42:14] in Tampa we have labsdb1-3, ciscos, with the sandisk ssds installed [18:42:24] ACKNOWLEDGEMENT - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours daniel_zahn known issue with redis template [18:42:28] in Ashburn we have labsdb1001-1002, ciscos, with sandisk ssds installed [18:42:44] we are goign to take another virt100# server for labsdb1003 [18:42:58] however, i doubt you have the spare sandisks for this, right? [18:43:21] right..i have 2 spare [18:43:34] can you gimme the model info? i'll make a ticket to order more [18:43:39] and we need 8 per cisco iirc [18:43:52] New review: Faidon; "Thanks for this!" [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/41976 [18:44:08] i will get it to you ...do you want a ticket...they're in the flex space [18:45:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:45:58] cmjohnson1: https://rt.wikimedia.org/Ticket/Display.html?id=4256 [18:46:02] just reply to that ticket with it [18:46:24] we'll be taking virt1008 and renaming it [18:46:30] so dropping tickets for those changes now as well [18:46:35] RobH, cmjohnson1: what would it take to replace SSDs on ms-be1005-1012 to 320s? [18:46:43] with 320s even [18:47:00] uhh, because they are 710s now? [18:47:52] no [18:47:54] robh: i don't think they are [18:48:01] 1001-1002 are temporarily 710s [18:48:06] the rest are X25-Ms [18:48:11] right^ [18:48:12] we dont have x25s in eqiad. [18:48:29] the equivalent you bought for here [18:48:34] indeed, 320s. [18:48:40] so those servers should already have 320s in them. [18:50:21] Model Family: Intel X18-M/X25-M/X25-V G2 SSDs [18:50:21] Device Model: INTEL SSDSA2M160G2GN [18:50:34] oddddd [18:50:36] how?!?! [18:50:39] we never ordered x25s. [18:51:04] cmjohnson1 will need to pop one and compare it to the 320s on site [18:51:08] cuz i think they are 320s... [18:51:10] they were the ssd's taken out of the c2100's [18:51:14] the x25 had a black ring [18:51:19] plastic ring on top [18:51:23] paravoid: okay to remove a disk from ms-be1010? [18:51:25] where the 320s were bare metal... [18:51:35] everything but 1001-1004 are okay to mess with [18:51:38] im interested to see what pulls out of there [18:51:39] ok [18:51:45] lemme know =] [18:52:04] thanks :) [18:52:36] robh, paravoid ...intel ssd 320 Series [18:52:42] huh [18:52:42] only tampa has the x.25 [18:52:52] that output was from ms-be1003 [18:53:00] could they share device models? [18:53:28] model ssdsa2bw160g3 [18:53:53] that's not what ms-be1003 had above [18:54:06] all the same ssds here [18:56:03] weird [18:58:32] cmjohnson1: could you do me a favor and check ms-be1005 too? [18:58:46] ms-be1010 is not provisioned, so I can't run hdparm [18:59:08] ms-be1005 is installed but safe to shut down [18:59:29] okay...do you want to shut it down first? [18:59:36] done [18:59:45] k [19:00:03] paravoid: they are [19:00:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.041 seconds [19:00:15] paravoid: my understanding is the 320 and the x25 are pretty much identical in operation [19:00:20] so them having the same chipset is not surprising [19:00:29] the smartmontools database has different model numbers for them [19:00:32] mutante: ahoy. ottomata mentioned you might be able to verify that /a/eventlogging on stat1 gets picked up by the tape backups -- can you? [19:00:41] awjr: you have the link? [19:00:49] AaronSchulz: yah, thanks :) [19:00:52] New patchset: Pyoungmeister; "coredb: es shards are now 100% inno." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42283 [19:01:13] ori-l: ok, lemme check on tridge [19:01:17] notpeter: \o/ [19:01:23] RobH: and in there it says "G3" for the 320, the label that cmjohnson1 also said G3, but hdparm/smartctl say G2 in my tests [19:01:27] mutante: thanks [19:01:31] paravoid/robh: so it's a different disk [19:01:43] mutante: it's a newish subdir, only been there for a week or two [19:01:43] oh? [19:01:54] are those X25-Ms? [19:01:55] Change abandoned: Nemo bis; "Dereckson, your comments don't make any sense." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33390 [19:02:04] AaronSchulz: that actually happened yesterday afternoon [19:02:05] hurray! [19:02:11] <^demon> paravoid: So yeah, I wasn't sure if ensure => absent would work on nfs anyway :) [19:02:13] in fact i checked i disks in 1011 and 1009 and the other disk in 1010 and they are all ssdsa2m1602gn [19:02:34] AaronSchulz: looks to me like purgeParserCache.php needs to be updated to support sharded bagOStuffz.. do you concur? [19:02:35] okay [19:02:43] so we do have X25-Ms in eqiad :) [19:02:45] the one disk i pulled on 1010 was the only 320 [19:02:51] bleh [19:03:09] PROBLEM - Host ms-be1005 is DOWN: PING CRITICAL - Packet loss = 100% [19:03:32] cmjohnson1: So we have a mix of X25 and 320s? (this makes little sense, bleeeeh) [19:03:34] ori-l: no, that issue is still not solved.. it just backs up home [19:04:00] yeah...now I need to check 1005-1012 to see [19:04:20] ori-l: it's an existing ticket RT-3098 that ottomata has [19:04:28] reason is /a being too large for a tape volume [19:04:37] cmjohnson1: now that we verified we can trust hdparm's output, I can check 1005-1007 for you [19:04:38] let me chime in quickly [19:04:43] (they're installed) [19:04:44] mutante and ori-l [19:04:46] k [19:04:51] woosters: sooooooooo i dont htink i assigned parsoid servers in ashburn after all [19:04:51] mutante, I think we excluded a lot of /a directories and got things working [19:04:54] reason that /a is too large is that wikistats has a 32gb stuff [19:04:58] not sure if I remember correctly though [19:04:58] but i have some spares, so i'll get them spun up today. [19:05:17] ottomata: but unfortunately i don't see stuff on tridge in /data/amanda [19:05:21] robh - thks [19:05:26] ottomata: just these 00001.stat1.wikimedia.org._home.0 [19:05:34] robh: isn't wtp1001 parsoid? or in addition to that [19:05:38] i would expect something like 00001.stat1.wikimedia.org._a.0 or something [19:05:48] so no _a [19:05:48] ah [19:06:11] cmjohnson1: in addition [19:06:23] so basically the servers we have allocated and will allocate today are single cpu misc servers [19:06:26] cmjohnson1: 1005 was X25, 1006 too [19:06:41] so they will be replaced for parsoid use once the new dual cpu misc servers arrive (which quotes i am working on today) [19:06:43] can't ssh to 1007 [19:06:46] paravoid: all but the 1 disk on 1010 [19:06:49] wow [19:06:52] that's some luck [19:07:15] the fact that it was the first disk you pulled at random [19:08:15] ottomata: do you get that Amanda email ? [19:08:16] crazy eh? [19:08:16] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42283 [19:08:30] heh yeah [19:08:31] ottomata: because i cant login right now.. but that should give a summary [19:08:56] robh: so will the allocated server for parsoid be temporary since they are single cpu? [19:08:58] so I think we agreed with binasher that we should use 320s for those boxes [19:09:19] paravoid: i need to see how many 320 [19:09:19] in the list, like two or three days ago [19:09:22] s i have here [19:09:23] ..and if it still has the "dump larger than" ...issue.. then it's still that [19:09:33] RobH: unleash the intel 320 orders [19:09:45] binasher: right? [19:09:55] ah yeah, mutante, i get them [19:10:01] FAILED "[dump larger than available tape space, 518277807 KB, incremental dump also larger than tape]" [19:10:14] * cmjohnson1 checking storage for # of 320's [19:10:23] should we split up /a/? [19:10:50] * paravoid wonders what we're going to do with a bunch of known-to-lose-data SSDs :) [19:10:51] if that is not a big issue for you that sounds like a solution. [19:10:54] ori-l: [19:11:27] ottomata: is there another path on stat1 i could rsync the data to that _would_ get picked up by the backups? [19:11:47] ori-l: a quote from andrew "Diederik and Erik, does all of /a actually need to be backed up? I know that a lot of the stuff in there is just duplicate log data copied over from the udp2log servers" [19:12:06] mutante: i only care about /a/eventlogging [19:12:24] New patchset: Silke Meyer; "Changes order of settings for role_config to work" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42285 [19:12:43] but there is other stuff in /a/ that is important, just not anything i oversee [19:12:44] ori-l: then let's configure that separately in backups ... [19:12:55] mutante: ok! that would be awesome, thanks! [19:13:03] ori-l: the alternative would be configuring a larger virtual tape size.. but none of us did that before http://wiki.zmanda.com/index.php/How_To:Set_Up_Virtual_Tapes [19:13:11] well, mutante, we're already excluding /a/squid/archive [19:13:19] which is all the log data [19:13:30] binasher: I have 24 SSDs for varnish boxes for you! [19:13:32] :P [19:13:43] we are also excluding wikistats/backup and wikistats/tmp [19:13:58] uhh, dammit.lt is 419G? [19:14:03] 119M eventlogging/ [19:14:05] <-- small [19:14:08] paravoid: they're perfect for that, but too small these days [19:14:09] erosen is 14G [19:14:19] fundraising 2.6G [19:14:20] ottomata: heh, that would be domas [19:14:32] yeah, eventlogging is small, i don't want to bundle it with the rest of the stuff if it means that it fails to back up periodically [19:14:33] wikistats git, 255G [19:15:04] hmmmmmmmmmmmmmm [19:15:08] paravoid robh i have enough 320's for 6 ms-be (includes the lone 320 in ms-be1010) [19:15:09] mutante, how does amanda work with symlinks? [19:15:11] here's the change https://gerrit.wikimedia.org/r/#/c/20946/2 [19:15:18] that reduced it last time [19:16:36] ottomata: i think it will backup the link itself but not the data it points to.. but also just googled that [19:17:11] aye [19:17:13] cmjohnson1: can we fit them to 1005-1007? [19:17:20] there are 1.9T in /a [19:17:40] paravoid: yep [19:17:46] great [19:17:49] thanks :) [19:17:49] * cmjohnson1 creates a ticket  [19:17:56] cmjohnson1: don't provision [19:17:58] or if you do [19:18:07] create a raid0 between those two ssds [19:18:17] in the controller [19:18:26] okay [19:18:47] oh hm [19:18:51] wait [19:19:00] ... [19:19:00] I have to give asher the 710s back at some point [19:19:13] yeah...but robh will need to order more 320s [19:19:29] binasher: that okay with you? [19:19:37] i have a bunch for asher if he needs them sooner [19:19:39] waiting for the next batch of 320s to give you back the 4 710s? [19:19:45] aha, okay [19:20:06] New patchset: Dzahn; "backup /a/eventlogging on stat1 separately because /a fails being too large" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42286 [19:20:11] okay then :) [19:20:12] ottomata: ori-l ..there [19:20:17] thanks! [19:20:17] mutante: thank you! [19:20:26] yw [19:20:43] that's fine, i still to either get ciscos back from analytics before using the remaining 710's in dbs [19:20:51] or order new servers.. but i think i'm getting ciscos [19:21:42] so if i need to order 320s [19:21:46] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42286 [19:21:47] someone has to tell me how many and why =] [19:22:14] New review: Ori.livneh; "Thanks again, much appreciated!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42286 [19:22:18] cmjohnson1: you want the labsdb1003 install ticket assigned to you? [19:22:21] PROBLEM - Puppet freshness on silver is CRITICAL: Puppet has not run in the last 10 hours [19:22:22] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [19:22:25] i alrady made all the other related tickets yours ;] [19:22:26] yes please robh [19:22:39] all yours [19:22:47] ottomata: so.. let's check that tomorrow. i did not touch the existing /a/ backup, just added it.. and /a/ will still fail.. but at least eventlogging should be safe [19:22:54] well we need 320's for at least 6 more ms-be servers so that's 12 plus spare [19:23:00] that's cool [19:23:04] RobH: however many ceph needs + 6 for the three misc servers you were allocating for me yesterday + let's say 20 [19:23:12] ori-l, also, reminder, we have eventlogging in hdfs [19:23:17] so its replicated 3 times across 10 nodes [19:23:19] in hadoop [19:23:45] ceph is independent of msbe or same servers? [19:23:50] same [19:23:59] ok, so yea 12 and spares [19:24:01] paravoid: can confirm [19:24:04] 20 seems reasonable, i'll make a ticket. [19:24:04] yes [19:24:13] ottomata: okay, but that whole setup is full of new, and it needs to be in use for a while longer before we can trust it as a backup [19:24:23] we might want them for ms-be @ pmtpa at some point [19:24:25] btw.. backups listed here http://wikitech.wikimedia.org/view/Backup_procedures#Misc_data [19:24:35] but not yet [19:24:50] claims its just 400G..but i guess its way more .. even with the excludes [19:25:00] mutante: would it be easy to add me to the automated emails you mentioned earlier? [19:25:22] ori-l: hmm.. i dont know yet ..hold on [19:25:36] New patchset: Pyoungmeister; "fix scoping issue for solr role class" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42287 [19:26:37] update: goign to go with 10 160GB for msbe [19:26:50] and 10 300GB for the other db type servers (i have three i'll be dropping tickets for) [19:27:06] robh: i need 12 for msbe [19:27:22] oh, 12 with or without spares? [19:27:23] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42287 [19:27:23] without [19:27:23] ok, then 16. [19:27:48] New review: Nemo bis; "Also, please don't assign me any shell bug again because it's my opinion that the fixer is the shell..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/33390 [19:28:00] paravoid: let's get rid of the c2100's first before we go messing w/tampa [19:28:12] ;-] [19:28:49] PROBLEM - Packetloss_Average on locke is CRITICAL: XML parse error [19:29:14] ori-l: hmm.. not really that easy.. it just mails root@ and even if i added your address as well you would get mail for all other backups, not just this one [19:29:54] mutante: oh, ok. is it alright then if i follow-up with you in a few days' time to make sure the backups are indeed getting picked up? [19:29:58] it's a global config in amanda [19:30:00] RECOVERY - Puppet freshness on solr1001 is OK: puppet ran at Fri Jan 4 19:29:40 UTC 2013 [19:30:05] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42285 [19:30:10] ori-l: yes please [19:30:34] mutante: thanks again [19:30:39] np [19:31:34] ori-l, re hdfs eventlogging, totally agree :) [19:31:39] RECOVERY - Packetloss_Average on locke is OK: OK: packet_loss_average is 3.45676918519 [19:32:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:36:51] MaxSem: you there? [19:37:04] notpeter, yes sir [19:37:05] paravoid: robh: as bust most of the 320's open they are all 300GB not 160GB [19:37:32] MaxSem: ok, puppet is no longer borken on the solr boxes [19:37:43] (there was a scoping issue that I didn't catch when I looke dover your stuff) [19:37:47] want to deploy repl on monday? [19:38:05] notpeter, sure [19:38:08] cool! [19:38:25] should we get a deploy slot? or just do it whenevs? [19:38:51] because it involves MW config syncs, I'd rather reserve a window for it [19:39:26] cool! want to schedule? [19:39:26] cmjohnson1: wait, we have 320s? [19:39:44] cmjohnson1: I thought we had x25s there? [19:39:50] * RobH is confused as shit [19:40:14] yes...we have a few...enough for 5 maybe 6 servers. [19:40:18] we need 12 more [19:40:25] notpeter, after 1pm PST or before 11am? [19:40:41] hhhmmm, either is fine for me. what works better for you? [19:40:42] I don't care until 3pm [19:41:22] let's to morning, then [19:41:26] (PST) [19:41:26] Yossie: welcome [19:43:06] https://wikitech.wikimedia.org/index.php?title=Deployments&diff=55061&oldid=55060 reserved 2 hours just in case [19:44:03] FOR SCIENCE! http://www.livescience.com/25959-atoms-colder-than-absolute-zero.html [19:44:38] negative temperatures? wth [19:44:54] yossie - welcome [19:45:00] "new engines that could technically be more than 100 percent efficient" [19:45:08] wow.. perpettum mobile for realz?? [19:46:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.030 seconds [19:46:34] Greg Bear wrote a book called Heads in which they use negative (<0K) temperatures.. It was curious, at the least, and featured in the much better book "Moving Mars" by same author. [19:50:28] MaxSem cool, sounds good [19:56:24] RECOVERY - Puppet freshness on solr1003 is OK: puppet ran at Fri Jan 4 19:56:02 UTC 2013 [19:57:00] RECOVERY - Puppet freshness on solr2 is OK: puppet ran at Fri Jan 4 19:56:33 UTC 2013 [19:57:01] RECOVERY - Puppet freshness on solr3 is OK: puppet ran at Fri Jan 4 19:56:37 UTC 2013 [19:57:10] RECOVERY - Puppet freshness on vanadium is OK: puppet ran at Fri Jan 4 19:56:56 UTC 2013 [19:58:38] Ryan_Lane: btw, ldap logins for my RT install are working now. Lots of questions about how to actually migrate to that though… probably I'll need to sit down with you and mutante next week and discuss. [19:59:08] heh [19:59:09] yeah [19:59:19] may need to rename user accounts in the database first [20:00:06] Yeah, I have no idea what happens when account names overlap. I see other folks online allowing that in their install but it's not clear what the behavior is. [20:00:20] andrewbogott: oh cool where is your install? [20:00:29] there are auto-created and "real" users though [20:01:13] auto-created have an email address format the others have nicknames [20:01:59] !log depooling ssl1001 [20:02:09] Logged the message, Master [20:06:55] mutante: abogott-request-tracker.pmtpa.wmflabs [20:07:22] andrewbogott: cool, is it 4.0.4 ? [20:07:35] we also have an upgrade ticket :) [20:07:47] no, 3.8.11 because I was imagining a gentle switch from the existing install. [20:08:00] I can only hope that the process for 4.0.4 is similar :/ [20:09:34] oh yea, of course it makes sense to test with a similar version.. but we can try the upgrade on labs later..that's also nice to have [20:09:35] New patchset: Pyoungmeister; "adding cmcmahon to admin.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42291 [20:10:19] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42291 [20:10:20] mutante: Getting rt up in labs from puppet was not especially difficult. The puppet stuff is a bit broken in that it installs apache and lighttp both and that caused me a great deal of confusion at first. [20:11:44] andrewbogott: i see.. seems we have too many ways to install Apache [20:12:12] It's just because of tangled dependencies in puppet… everything trickles down to Apache eventually [20:16:18] andrewbogott: i still use this class {'webserver::php5': ssl => 'true'; } btw [20:16:28] be back after lunch [20:17:11] mutante: Right, that installs Apache, but rt actually runs on lighttpd [20:18:36] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:28:23] New patchset: Asher; "adding no_master array to the db topology, to map to the mysql-master-ha "no_master=1" node config option. snapshot hosts plus hosts defined here will be excluded from consideration for master promotion." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42292 [20:29:09] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42292 [20:29:34] New patchset: Pyoungmeister; "coredb: include slow_query_digest in a not ridiculous way" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42293 [20:30:55] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42293 [20:31:29] mutante: I was meaning something to match https://noc.wikimedia.org/conf/images/document.png [20:32:36] we do get some interesting files in commons [20:32:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.050 seconds [20:34:14] New review: Demon; "Is this still necessary? If not, please abandon." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/2141 [20:40:38] ^ Looks redundant to Tims work on fluorine... [20:43:39] <^demon> Reedy: That's what I thought, and at the very least would need to be rewritten. [20:43:45] <^demon> Figured I'd ask Roan though. [20:52:40] !log db1038 removing bad hdd from slot6 for replacement [20:52:49] Logged the message, Master [21:00:24] log db1042 removing/replacing bad hdd from slot 9 [21:01:30] !log db1042 removing/replacing bad hdd from slot 9 [21:01:40] Logged the message, Master [21:04:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:10:22] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [21:14:34] PROBLEM - Host ms-be1007 is DOWN: PING CRITICAL - Packet loss = 100% [21:18:18] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [21:19:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.241 seconds [21:22:51] Ryan_Lane: where is all the git-deploy stuff? [21:23:16] http://wikitech.wikimedia.org/view/Git-deploy [21:23:20] !log restarting torrus-common on manutius [21:23:29] Logged the message, Master [21:23:34] * AaronSchulz looks at operations/debs/git-deploy [21:23:50] !log authdns-update for some mgmt name changes [21:23:59] Logged the message, RobH [21:25:02] !log mauntius/torrus: http://wikitech.wikimedia.org/view/Torrus#Deadlock_problem [21:25:12] Logged the message, Master [21:26:07] AaronSchulz: looks to me like purgeParserCache.php needs to be updated to support sharded bagOStuffz.. do you concur? [21:26:15] why? [21:26:50] it just calls deleteObjectsExpiringBefore() [21:27:07] that function was updated wasn't it? [21:28:20] New patchset: Dzahn; "add parser cache purging script as cronjob on hume, per RT-2108" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/38275 [21:28:20] ah, yeah, amending one more patch set to let that cron job run once a way (weekday 0) [21:29:02] s/way/week [21:29:45] AaronSchulz: ah yes, it iterates thru the shards sequentially.. nevermind! [21:30:54] RobH: torrus fixed:) [21:32:34] RECOVERY - Host ms-be1007 is UP: PING OK - Packet loss = 0%, RTA = 26.54 ms [21:36:00] PROBLEM - Host ms-be1006 is DOWN: PING CRITICAL - Packet loss = 100% [21:36:28] RECOVERY - SSH on ms-be1007 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:39:10] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/38275 [21:41:52] RECOVERY - Host ms-be1006 is UP: PING OK - Packet loss = 0%, RTA = 26.66 ms [21:42:55] WTF ?!?! [21:42:58] neon started working on its own [21:42:59] puppet [21:43:09] LeslieCarr: :)) [21:43:22] did you do something to fix it ? [21:43:28] or is this seriously miraculous fixing [21:43:33] no, i guess it was just puppet having issues [21:43:38] yesterday [21:43:47] my brain as-plode [21:44:10] bad timing when puppetmaster was restarted and now something refreshed? shrug [21:44:51] LeslieCarr: i did nahsing [21:45:52] weird [21:45:54] so weird [21:46:21] PROBLEM - SSH on ms-be1006 is CRITICAL: Connection refused [21:47:24] and.. puppet in general is not that slow anymore [21:47:31] like on hume.. just 30 seconds [21:48:08] <^demon> It's not bad on the gerrit boxes either. [21:49:48] RECOVERY - SSH on ms-be1006 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:50:34] New patchset: Lcarr; "allowing spence to talk to neon nrpe + deleting unused template" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42350 [21:50:38] well this could be a cpu valley [21:50:39] ? [21:50:43] checking ganglia ... [21:50:52] RECOVERY - Host ms-be1005 is UP: PING WARNING - Packet loss = 44%, RTA = 26.86 ms [21:51:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:51:21] LeslieCarr: valley just ended [21:51:29] yeah [21:51:29] as nagios-wm just reported [21:51:31] hehe [21:54:45] PROBLEM - SSH on ms-be1005 is CRITICAL: Connection refused [21:56:50] Ryan_Lane: http://www.gossamer-threads.com/lists/wiki/wikitech/311155?do=post_view_threaded#311155 [21:58:39] New patchset: Nemo bis; "(bug 29692) Per-wiki namespace aliases shouldn't override (remove) global ones" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/25737 [22:00:00] RECOVERY - SSH on ms-be1005 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [22:01:39] PROBLEM - Host ms-be1006 is DOWN: PING CRITICAL - Packet loss = 100% [22:06:09] New patchset: Pyoungmeister; "coredb: addressing researchdb case" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42352 [22:07:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.045 seconds [22:07:24] binasher: set of eyes on that ^^ [22:07:36] sure [22:09:47] RECOVERY - Host ms-be1006 is UP: PING OK - Packet loss = 0%, RTA = 26.77 ms [22:11:06] I'm just not that familiar with the ins and outs of the reasearch dbs. I think that that's all that's special about them.... [22:11:17] but you might know things that I don't about this [22:13:14] RECOVERY - Puppet freshness on ms-be1007 is OK: puppet ran at Fri Jan 4 22:13:04 UTC 2013 [22:14:03] notpeter: all of the researchdbs are currently in s1, but that might change at some point this year [22:15:49] Ryan_Lane: "#176: Change MySQL on tesla vms to use the wmf version" :) [22:16:00] "We shouldn't bother doing this on tesla, but should instead do this on [22:16:03] the virt cluster as part of the default project." [22:17:48] !log updated asw-a-eqiad vlan for 4/0/44 [22:17:57] Logged the message, RobH [22:20:58] binasher: yeah, i thought that might happen at some point, but I'd rather not pre-optimize. legit? [22:22:50] can I elect that future-notpeter will write that code? [22:22:55] notpeter: i also don't really like having a class for them, that just calls another class [22:23:04] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42350 [22:23:36] so you'd rather have, like, a researchdb flag, or something? [22:24:02] and an array, like master/snapshots? [22:24:07] meh, hate all of this [22:24:11] :( [22:24:17] hmm, keep the class [22:24:20] RECOVERY - Puppet freshness on ms-be1006 is OK: puppet ran at Fri Jan 4 22:24:14 UTC 2013 [22:26:04] yay edgecases! [22:27:43] New patchset: Pyoungmeister; "coredb: addressing researchdb case" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42352 [22:29:09] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42352 [22:29:44] RECOVERY - Puppet freshness on ms-be1005 is OK: puppet ran at Fri Jan 4 22:29:16 UTC 2013 [22:33:20] RECOVERY - NTP on ms-be1007 is OK: NTP OK: Offset -0.01997196674 secs [22:33:30] RECOVERY - mysqld processes on es1001 is OK: PROCS OK: 1 process with command name mysqld [22:33:47] closing "#352: detect outdated phase3 images in prod" :) [22:35:31] !log es1001 back in prodcution. all external store are innodb! [22:35:39] Logged the message, notpeter [22:36:58] notpeter: i'm running stop slave ; reset master ; reset slave; on es100[2-4] [22:37:01] cmjohnson1: oh wow, thanks! [22:37:24] binasher: ok [22:38:11] paravoid: looks correct to me ...plz chk and let me know [22:38:24] Reedy: do we have sitemaps on commons? [22:39:04] i mean.. a cron that runs a sitemap generation script [22:39:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:39:25] but it also said in 2011 "In the near future we will want sitemaps to be generated in a more clever way, especially since Commons content is exploding, ..." ...shrug [22:40:15] I think we did... [22:40:30] mutante: maintenance/generateSitemap.php [22:41:09] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [22:41:09] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [22:41:43] Ryan_Lane: http://wikitech.wikimedia.org/view/Git-deploy#Timeline_for_slots [22:41:51] so can we just do it the second way then? [22:42:00] you can do it however you'd like :) [22:42:13] those are just ideas [22:42:57] so I can delete the first one? :) [22:43:05] yep [22:43:30] whatever you guys think is sanest for managing slots [22:44:17] RECOVERY - NTP on ms-be1006 is OK: NTP OK: Offset -0.01913559437 secs [22:46:08] !log temporarily setting up mysql replication on shard es1 again (es([1-4]|100[1-4]) for mha testing [22:46:17] Logged the message, Master [22:48:30] Reedy: thanks, just looking at ancient tickets to get rid of:) [22:48:48] binasher: fwiw, I'd love to learn more about MHA when you're done :-) [22:49:24] RECOVERY - NTP on ms-be1005 is OK: NTP OK: Offset -0.02033925056 secs [22:49:35] paravoid: if/when mha is done, everyone will have to learn about it. but enthusiasm is good! [22:51:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.050 seconds [22:56:27] binasher: like how all of us know how to do a master switchover now? :P [22:56:57] but optimism is good too :-) [22:59:00] paravoid: https://wikitech.wikimedia.org/view/Master_switch [22:59:15] "This Connection is Untrusted" [22:59:24] yeah I've seen it [22:59:31] stop trying to hack me! [22:59:36] ;) [23:00:27] paravoid: so if you don't know how to do anything outlined in it, ask notpeter [23:00:37] * binasher ducks  [23:00:41] binasher: honestly, I think I'd just call you and wake you up before attempting to do that [23:01:00] nono, wake up notpeter [23:01:05] hahaha [23:02:07] but jokes aside, if you're feeling up to it I'd love to hear more about MHA, not just point-by-point instructions on how to do a switchover [23:02:16] what do you like, what you don't like etc. [23:02:47] might be too much to ask though :) [23:09:44] paravoid: i'm not certain we'll actually use mha, or that if we do, it will be for the long term [23:16:55] paravoid: longer term, i currently like the idea of all db's within a shard in a single datacenter using synchronous multi-master replication via galera, with async standard mysql replication across colos to another multi-master galera cluster. mysql failover would only exist within the context of datacenter failover [23:24:19] Ryan_Lane: is most of the git deploy code in puppet? [23:25:41] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:26:28] not seeing much in the repo [23:29:02] !log authdns-update for xenon and caesium dns [23:29:12] Logged the message, RobH [23:29:29] AaronSchulz: ryan went away from computer for awhile, ran to shop down the block [23:29:42] (didnt want you to sit there waiting and wondering if he was going to answer you ;) [23:29:44] "shop"? you mean booze? [23:29:52] yes. [23:29:54] heh [23:30:13] friday \o/ [23:32:02] It's friday, friday, friday, gotta get down on friday [23:33:47] PROBLEM - Auth DNS on ns1.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [23:36:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.284 seconds [23:39:02] RECOVERY - Auth DNS on ns1.wikimedia.org is OK: DNS OK: 0.052 seconds response time. www.wikipedia.org returns 208.80.154.225 [23:39:50] New patchset: Pyoungmeister; "testing: set db62 as researchdb" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42356 [23:43:37] New patchset: Pyoungmeister; "testing: set db62 as researchdb" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42356 [23:44:52] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42356 [23:48:58] PROBLEM - Host db62 is DOWN: PING CRITICAL - Packet loss = 100% [23:54:49] RECOVERY - Host db62 is UP: PING OK - Packet loss = 0%, RTA = 1.50 ms [23:59:01] PROBLEM - MySQL Idle Transactions on db62 is CRITICAL: Connection refused by host [23:59:10] PROBLEM - MySQL Slave Running on db62 is CRITICAL: Connection refused by host [23:59:37] PROBLEM - MySQL Recent Restart on db62 is CRITICAL: Connection refused by host [23:59:38] PROBLEM - MySQL disk space on db62 is CRITICAL: Connection refused by host [23:59:56] PROBLEM - MySQL Replication Heartbeat on db62 is CRITICAL: Connection refused by host