[00:00:17] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42205 [00:00:35] binasher: None of the 3 new databases I added are there on db64 [00:00:40] Looks like it's not replicating them.. [00:00:46] New patchset: Asher; "pull db64 from s3" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42207 [00:00:49] db64 is totally broken [00:01:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.926 seconds [00:01:03] strong words [00:01:07] i think it wasn't actually in production until 18:54:48 [00:01:28] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42207 [00:02:17] fail [00:02:44] !log taking down kuo, its power settings are freaking it out, will fix and bring back into servie [00:02:54] Logged the message, RobH [00:02:56] !log asher synchronized wmf-config/db.php 'pulling db64 from s3' [00:03:06] Logged the message, Master [00:03:37] https://gerrit.wikimedia.org/r/gitweb?p=operations/mediawiki-config.git;a=history;f=wmf-config/db.php;h=c35cad88eb45beba1ee4188e28d4a92863e4c76f;hb=HEAD [00:07:09] RECOVERY - MySQL Slave Running on db64 is OK: OK replication [00:07:45] RECOVERY - MySQL Replication Heartbeat on db64 is OK: OK replication delay seconds [00:10:00] PROBLEM - mysqld processes on db64 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [00:10:27] RECOVERY - MySQL Slave Delay on es1002 is OK: OK replication delay NULL seconds [00:10:37] RECOVERY - MySQL Slave Delay on es1004 is OK: OK replication delay NULL seconds [00:11:41] !log rebuilding db64 via hotbackup of db1010 [00:11:49] Logged the message, Master [00:12:06] Reedy: yeah, that's why i was wondering if db.php on disk on fenari was out of sync with git [00:12:57] !log converting es1001 to innodb [00:13:05] Logged the message, notpeter [00:13:55] Mmmm [00:14:03] PROBLEM - mysqld processes on es1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [00:14:06] I'm guessing it must've been [00:14:21] huh, tim killed queries on it on 12/27 according to the server admin log [00:14:30] PROBLEM - Host kuo is DOWN: PING CRITICAL - Packet loss = 100% [00:14:51] and it looks like replication on it broke at Dec 27 08:53 [00:15:03] so maybe it was sorta pulled then [00:15:18] doesn't explain ptwikivoyage though [00:15:51] !log modifying kuo bios power profile from DAPC to OS performance per watt setting in attempt to resolve runaway process issue [00:16:00] Logged the message, RobH [00:19:00] RECOVERY - Host kuo is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [00:19:03] binasher: https://rt.wikimedia.org/Ticket/Display.html?id=4105 [00:19:57] !log starting innobackupex from es1002 to es1001 [00:20:07] Logged the message, notpeter [00:35:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:37:36] !log bsitu Finished syncing Wikimedia installation... : Update Echo to Master [00:37:44] Logged the message, Master [00:43:32] New patchset: Pyoungmeister; "swapping db61 and db62 to coredb stuffs for testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42208 [00:45:40] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42208 [00:50:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.027 seconds [01:08:03] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [01:08:39] New patchset: Ryan Lane; "Only define the directories if they aren't already" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42209 [01:09:10] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42209 [01:11:39] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 291 seconds [01:12:03] New patchset: Ryan Lane; "Fix deploy state directory requirement" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42210 [01:16:01] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [01:16:01] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [01:16:01] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [01:16:54] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 23 seconds [01:17:47] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42210 [01:23:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:33:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.414 seconds [01:38:58] Change abandoned: Ryan Lane; "Handled by a newer change" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/39201 [01:40:07] New patchset: Ryan Lane; "Sort all hashes used in templates" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42212 [01:41:35] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42212 [01:53:44] New patchset: Andrew Bogott; "Work around a bug with TLS in opendj." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42213 [01:56:25] andrewbogott: one inline comment [01:56:53] I just made the privs match the files as created by the .deb [01:57:01] yeah [01:57:07] All the files in that dir are -rw-r--r-- [01:57:16] yep [01:57:26] but we like to make files owned by puppet read only [01:57:36] Makes sense. [01:58:53] New patchset: Andrew Bogott; "Work around a bug with TLS in opendj." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42213 [02:01:13] New patchset: Ryan Lane; "Fix template location" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42214 [02:01:55] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42214 [02:04:31] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42213 [02:04:37] andrewbogott: ^^ [02:04:43] thanks [02:07:09] yw [02:07:59] Ryan_Lane: There's another bigger, less pressing patch awaiting your review. [02:08:07] oh? [02:08:21] https://gerrit.wikimedia.org/r/#/c/40344/ ? [02:08:22] https://gerrit.wikimedia.org/r/#/c/40344/ [02:08:25] * Ryan_Lane nods [02:08:28] Yeah, from a while ago. [02:08:42] It's too big to actually read :( [02:08:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:10:18] if we're really lucky I haven't edited them manually on a bunch of systems :D [02:10:25] I doubt that, though [02:10:37] andrewbogott: I +1'd it [02:10:45] when you're ready to merge it in and such, go for it [02:10:51] ok, thanks. [02:11:01] Yeah, I worry that they'll clobber all kinds of local customizations here and there :( [02:12:14] I almost always moved my changes into svn [02:12:19] unless it was a total hack [02:12:32] then I fixed the hack and moved it into svn [02:12:46] labs-nfs1 had some hacks that never got fixed [02:12:49] same with labstore2 [02:12:57] the only other system that uses the scripts is formey [02:13:03] and people rarely use them [02:14:07] So we're probably pretty safe -- those files were duped from labstore2, and labs-nfs1 is mostly obsolete [02:14:27] * Ryan_Lane nods [02:14:30] agreed [02:14:53] the only script we really use on formey is the one to modify groups [02:15:01] so I doubt anything bad will happen [02:15:33] PROBLEM - MySQL Replication Heartbeat on db1010 is CRITICAL: CRIT replication delay 254 seconds [02:15:42] PROBLEM - MySQL Slave Delay on db1010 is CRITICAL: CRIT replication delay 262 seconds [02:19:09] RECOVERY - MySQL Slave Delay on db1010 is OK: OK replication delay 0 seconds [02:19:18] RECOVERY - MySQL Replication Heartbeat on db1010 is OK: OK replication delay 1 seconds [02:20:47] !log truncated ( de | it )wikivoyage.searchindex tables [02:20:59] Logged the message, Master [02:22:46] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.024 seconds [02:24:15] RECOVERY - mysqld processes on db64 is OK: PROCS OK: 1 process with command name mysqld [02:26:54] !log LocalisationUpdate completed (1.21wmf6) at Fri Jan 4 02:26:53 UTC 2013 [02:27:04] Logged the message, Master [02:27:46] !log restarted replication on db64 from db34 [02:27:57] Logged the message, Master [02:29:03] PROBLEM - MySQL Replication Heartbeat on db64 is CRITICAL: CRIT replication delay 1038 seconds [02:32:03] PROBLEM - MySQL Slave Delay on db64 is CRITICAL: CRIT replication delay 1050 seconds [02:37:03] Ryan_Lane, java.security isn't getting applied on virt0. What am I missing? [02:38:52] andrewbogott: in which way do you mean? [02:38:57] RECOVERY - MySQL Slave Delay on db64 is OK: OK replication delay 0 seconds [02:39:01] the file isn't? [02:39:06] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [02:39:06] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [02:39:10] applied or installed? [02:39:22] right… /etc/java-6-openjdk/security/java.security [02:39:29] the file there is still the original one, not the one from puppet [02:39:32] ah [02:39:40] Not sure if I mean 'applied' or 'installed' [02:39:40] heh [02:39:43] RECOVERY - MySQL Replication Heartbeat on db64 is OK: OK replication delay 0 seconds [02:39:43] 'installed' I guess [02:39:45] you didn't merge it in ;) [02:39:57] I just did so for you [02:40:05] oh, on sockpuppet? [02:40:19] yep [02:40:23] Weird, I did rebase on a 2nd machine and saw the change. I wonder what... [02:41:42] Oh, of course, my local system is pulling from gerrit not from sockpuppet. [02:41:54] Sheesh, definitely time to call it a day [02:41:59] :D [02:45:39] New patchset: Asher; "Revert "pull db64 from s3"" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42215 [02:46:03] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42215 [02:46:09] RECOVERY - Puppet freshness on neon is OK: puppet ran at Fri Jan 4 02:46:06 UTC 2013 [02:47:33] !log asher synchronized wmf-config/db.php 'returning db64 to s3' [02:47:42] Logged the message, Master [02:49:15] !log LocalisationUpdate completed (1.21wmf7) at Fri Jan 4 02:49:14 UTC 2013 [02:49:23] Logged the message, Master [02:49:44] Well, I still can't log in but the traffic looks right at least. [02:49:48] More of this tomorrow I guess [02:59:39] Question about the creations of our three new wikis. How do you change the interwiki map of a wiki? e.g. switch from Wikipedia standard interwiki map to the wikisource one? [03:09:39] Dereckson: There's only one interwiki map for all Wikimedia wikis, as far as I know. [03:09:47] It can be read via the API or via Special:Interwiki. [03:11:36] 21:02:56 < Busy_importing> Dereckson: It seems to me that as.ws has the wrong interwiki configuration? (wikipedia) http://as.wikisource.org/wiki/Special:Interwiki [03:13:04] (busy_importing being MF-Warburg) [03:15:51] Dereckson: I don't see an issue. [03:44:57] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 187 seconds [03:45:24] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 204 seconds [03:50:12] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [03:50:40] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [04:01:00] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [04:01:01] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [04:01:01] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [04:01:01] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [04:01:01] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [04:01:01] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [04:32:21] RECOVERY - swift-account-reaper on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [04:37:26] PROBLEM - swift-account-reaper on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [04:37:45] PROBLEM - mysqld processes on db62 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [05:08:20] PROBLEM - Puppet freshness on solr2 is CRITICAL: Puppet has not run in the last 10 hours [05:10:26] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [05:16:18] PROBLEM - Auth DNS on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [05:20:20] PROBLEM - Puppet freshness on solr3 is CRITICAL: Puppet has not run in the last 10 hours [05:20:21] PROBLEM - Puppet freshness on solr1003 is CRITICAL: Puppet has not run in the last 10 hours [05:21:23] PROBLEM - Puppet freshness on solr1001 is CRITICAL: Puppet has not run in the last 10 hours [05:25:26] PROBLEM - Full LVS Snapshot on db66 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:25:27] PROBLEM - mysqld processes on db66 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:25:27] PROBLEM - MySQL Idle Transactions on db66 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:25:27] PROBLEM - MySQL Slave Running on db66 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [05:27:05] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 8.743 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [05:27:06] RECOVERY - MySQL Idle Transactions on db66 is OK: OK longest blocking idle transaction sleeps for 0 seconds [05:27:06] RECOVERY - MySQL Slave Running on db66 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [05:27:06] RECOVERY - mysqld processes on db66 is OK: PROCS OK: 1 process with command name mysqld [05:27:06] RECOVERY - Full LVS Snapshot on db66 is OK: OK no full LVM snapshot volumes [05:51:51] PROBLEM - Auth DNS on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [05:53:30] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 7.230 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [05:59:04] PROBLEM - Auth DNS on labs-ns0.wikimedia.org is CRITICAL: CRITICAL - Plugin timed out while executing system call [06:13:18] RECOVERY - Auth DNS on labs-ns0.wikimedia.org is OK: DNS OK: 5.047 seconds response time. nagiostest.beta.wmflabs.org returns 208.80.153.219 [07:14:57] New patchset: Dereckson; "(bug 43617) FlaggedRevs configuration for pl.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42217 [07:42:04] what? :o [07:43:56] ah [07:47:56] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [08:18:41] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [08:20:11] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [08:25:08] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [08:34:53] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [08:42:41] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.057 second response time [08:53:56] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:55:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.173 seconds [09:20:56] PROBLEM - Puppet freshness on silver is CRITICAL: Puppet has not run in the last 10 hours [09:20:56] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [09:29:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:43:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.043 seconds [10:15:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:25:22] New patchset: Aude; "(bug 43630) set autoconfirm count to 10 @ fawiki, per consensus" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42232 [10:28:08] RECOVERY - swift-account-reaper on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [10:29:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.062 seconds [10:33:24] PROBLEM - swift-account-reaper on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [10:58:44] PROBLEM - MySQL Slave Delay on db1005 is CRITICAL: CRIT replication delay 192 seconds [10:59:29] PROBLEM - MySQL Replication Heartbeat on db1005 is CRITICAL: CRIT replication delay 204 seconds [11:02:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:08:57] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [11:13:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.360 seconds [11:16:53] PROBLEM - Puppet freshness on ms-be1005 is CRITICAL: Puppet has not run in the last 10 hours [11:16:54] PROBLEM - Puppet freshness on ms-be1006 is CRITICAL: Puppet has not run in the last 10 hours [11:16:54] PROBLEM - Puppet freshness on ms-be1007 is CRITICAL: Puppet has not run in the last 10 hours [11:47:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:03:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.045 seconds [12:32:30] RECOVERY - MySQL Replication Heartbeat on db1005 is OK: OK replication delay 0 seconds [12:33:14] RECOVERY - MySQL Slave Delay on db1005 is OK: OK replication delay 0 seconds [12:35:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:39:50] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [12:39:50] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [12:45:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.474 seconds [13:20:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:35:03] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.207 seconds [14:01:53] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [14:01:54] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [14:01:54] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [14:01:54] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [14:01:54] PROBLEM - Puppet freshness on ms-be1010 is CRITICAL: Puppet has not run in the last 10 hours [14:01:54] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [14:06:59] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:23:02] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.035 seconds [14:23:48] Can someone please run: chown mwdeploy /home/wikipedia/common/wmf-config/ExtensionMessages-1.21wmf7.php [14:25:38] !log demon Started syncing Wikimedia installation... : [14:25:48] Logged the message, Master [14:55:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:07:44] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.045 seconds [15:08:39] !log demon Finished syncing Wikimedia installation... : [15:08:48] Logged the message, Master [15:08:53] <^demon> Yay, about time. [15:09:50] PROBLEM - Puppet freshness on solr2 is CRITICAL: Puppet has not run in the last 10 hours [15:09:57] :) [15:11:56] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [15:15:40] So, next step is to push wikidatawiki back to wmf7? [15:17:27] aude: ^ [15:17:32] -tech got busy ;) [15:18:03] brb, just gotta get a cable [15:19:26] yea [15:19:28] yes [15:21:50] PROBLEM - Puppet freshness on solr1003 is CRITICAL: Puppet has not run in the last 10 hours [15:21:50] PROBLEM - Puppet freshness on solr3 is CRITICAL: Puppet has not run in the last 10 hours [15:22:53] PROBLEM - Puppet freshness on solr1001 is CRITICAL: Puppet has not run in the last 10 hours [15:28:07] ok [15:28:59] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikidatawiki to 1.21wmf7 [15:29:05] yay! [15:29:07] Logged the message, Master [15:29:23] Everything not broken? ;) [15:29:26] looks good [15:29:35] * aude tries making an item [15:29:59] Have the bots been on holiday? [15:30:07] ignore that [15:33:10] works for me :) [15:33:47] Yay [15:34:34] thanks Reedy and ^demon for helping with this [15:34:44] Chad did most of it [15:34:45] :) [15:36:15] now we have a way to kick the memcached for the sites stuff, so hope we don't see such bugs again [15:40:00] Whaaaa [15:40:01] -rw-r--r-- 1 dzahn wikidev 2642 Oct 18 02:32 index.php [15:41:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:42:33] New patchset: Reedy; "Split dblists into their own subheading" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42269 [15:45:34] New patchset: Reedy; "wikidatawiki to 1.21wmf7" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42270 [15:45:52] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42270 [15:46:37] New patchset: Reedy; "Split dblists into their own subheading" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42269 [15:47:48] New patchset: Reedy; "Split dblists into their own subheading" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42269 [15:53:48] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.047 seconds [16:03:48] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [16:27:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:39:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.040 seconds [16:57:48] PROBLEM - MySQL Replication Heartbeat on db66 is CRITICAL: CRIT replication delay 303 seconds [16:58:42] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.003 second response time on port 11000 [17:01:25] RECOVERY - MySQL Replication Heartbeat on db66 is OK: OK replication delay 0 seconds [17:12:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:28:04] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.026 seconds [17:49:21] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [17:55:10] New patchset: Dzahn; "planet: have one theme config per language" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42277 [17:57:34] reedy@fenari:/home/wikipedia/common/docroot/noc/conf$ ls -al | grep dzahn [17:57:34] -rw-r--r-- 1 dzahn wikidev 9168 Oct 18 02:32 fc-list [17:57:34] -rw-r--r-- 1 dzahn wikidev 2642 Oct 18 02:32 index.php [17:57:42] mutante: ^ could you g+w those for me please? [17:57:51] Reedy: looks legit. that dzahn guy is alright [17:58:06] heh [17:58:55] Reedy: done [17:58:58] thanks [17:59:04] no prob! [17:59:18] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42269 [17:59:23] heh :) [17:59:41] i am not even sure when that happened [17:59:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:00:18] Probably when we were moving the docroots about [18:00:20] https://noc.wikimedia.org/conf/ [18:00:23] ^ dblist is slightly nicer [18:00:43] Anyone know of a better image (cylinder on a piece of paper?) for the dblists? [18:02:06] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 182 seconds [18:02:44] I wonder where those images even came from.. [18:02:50] Krinkle|detached: ^^ Did you add them? [18:03:36] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 225 seconds [18:14:07] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.041 seconds [18:18:11] Reedy: http://commons.wikimedia.org/wiki/Category:Database_icons [18:18:26] http://commons.wikimedia.org/wiki/File:Database-mysql.svg that? [18:20:16] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42277 [18:23:29] New patchset: RobH; "adding tarin back into poolcounter pool" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42280 [18:24:30] well... pre-puppet that change would have synced out and crashed poolcounter service... [18:24:33] yay for jenkins [18:25:03] New patchset: RobH; "adding tarin back into poolcounter pool" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42280 [18:25:03] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [18:25:12] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [18:28:30] New review: RobH; "this isnt the self review you are looking for" [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/42280 [18:28:44] New review: RobH; "this isnt the self review you are looking for" [operations/mediawiki-config] (master); V: 2 C: 2; - https://gerrit.wikimedia.org/r/42280 [18:28:44] Change merged: RobH; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/42280 [18:29:21] !log robh synchronized wmf-config/PoolCounterSettings.php [18:29:31] Logged the message, Master [18:29:55] re-pool successful \o/ [18:33:38] New review: Dereckson; "shellpolicy" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/33390 [18:34:40] sbernardin1: https://rt.wikimedia.org/Ticket/Display.html?id=4032 the ticket for the intel SFP [18:34:48] so if they are in, sign, scan, attach packign slip to ticket and resolve [18:35:17] RobH: doing that now... [18:35:32] cool, i hadnt assigned ticket for you [18:35:35] so just lettin ya know [18:36:16] cmjohnson1: just assigned the eqiad sfp ticket to you for same =] [18:36:18] PROBLEM - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours [18:36:32] k [18:37:12] <-- stat1: Failed to parse template redis/redis.conf.erb: Could not find value for 'redis_replication' [18:41:45] cmjohnson1: Ok, the labsdb saga [18:41:52] I'll recap what we have so we're on same page. [18:41:56] k [18:42:00] (I was able to poke asher and ryan about it since they sit by me) [18:42:14] in Tampa we have labsdb1-3, ciscos, with the sandisk ssds installed [18:42:24] ACKNOWLEDGEMENT - Puppet freshness on stat1 is CRITICAL: Puppet has not run in the last 10 hours daniel_zahn known issue with redis template [18:42:28] in Ashburn we have labsdb1001-1002, ciscos, with sandisk ssds installed [18:42:44] we are goign to take another virt100# server for labsdb1003 [18:42:58] however, i doubt you have the spare sandisks for this, right? [18:43:21] right..i have 2 spare [18:43:34] can you gimme the model info? i'll make a ticket to order more [18:43:39] and we need 8 per cisco iirc [18:43:52] New review: Faidon; "Thanks for this!" [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/41976 [18:44:08] i will get it to you ...do you want a ticket...they're in the flex space [18:45:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:45:58] cmjohnson1: https://rt.wikimedia.org/Ticket/Display.html?id=4256 [18:46:02] just reply to that ticket with it [18:46:24] we'll be taking virt1008 and renaming it [18:46:30] so dropping tickets for those changes now as well [18:46:35] RobH, cmjohnson1: what would it take to replace SSDs on ms-be1005-1012 to 320s? [18:46:43] with 320s even [18:47:00] uhh, because they are 710s now? [18:47:52] no [18:47:54] robh: i don't think they are [18:48:01] 1001-1002 are temporarily 710s [18:48:06] the rest are X25-Ms [18:48:11] right^ [18:48:12] we dont have x25s in eqiad. [18:48:29] the equivalent you bought for here [18:48:34] indeed, 320s. [18:48:40] so those servers should already have 320s in them. [18:50:21] Model Family: Intel X18-M/X25-M/X25-V G2 SSDs [18:50:21] Device Model: INTEL SSDSA2M160G2GN [18:50:34] oddddd [18:50:36] how?!?! [18:50:39] we never ordered x25s. [18:51:04] cmjohnson1 will need to pop one and compare it to the 320s on site [18:51:08] cuz i think they are 320s... [18:51:10] they were the ssd's taken out of the c2100's [18:51:14] the x25 had a black ring [18:51:19] plastic ring on top [18:51:23] paravoid: okay to remove a disk from ms-be1010? [18:51:25] where the 320s were bare metal... [18:51:35] everything but 1001-1004 are okay to mess with [18:51:38] im interested to see what pulls out of there [18:51:39] ok [18:51:45] lemme know =] [18:52:04] thanks :) [18:52:36] robh, paravoid ...intel ssd 320 Series [18:52:42] huh [18:52:42] only tampa has the x.25 [18:52:52] that output was from ms-be1003 [18:53:00] could they share device models? [18:53:28] model ssdsa2bw160g3 [18:53:53] that's not what ms-be1003 had above [18:54:06] all the same ssds here [18:56:03] weird [18:58:32] cmjohnson1: could you do me a favor and check ms-be1005 too? [18:58:46] ms-be1010 is not provisioned, so I can't run hdparm [18:59:08] ms-be1005 is installed but safe to shut down [18:59:29] okay...do you want to shut it down first? [18:59:36] done [18:59:45] k [19:00:03] paravoid: they are [19:00:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.041 seconds [19:00:15] paravoid: my understanding is the 320 and the x25 are pretty much identical in operation [19:00:20] so them having the same chipset is not surprising [19:00:29] the smartmontools database has different model numbers for them [19:00:32] mutante: ahoy. ottomata mentioned you might be able to verify that /a/eventlogging on stat1 gets picked up by the tape backups -- can you? [19:00:41] awjr: you have the link? [19:00:49] AaronSchulz: yah, thanks :) [19:00:52] New patchset: Pyoungmeister; "coredb: es shards are now 100% inno." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/42283 [19:01:13] ori-l: ok, lemme check on tridge [19:01:17]