[00:01:18] !log just completed an online schema change for commonswiki.recentchanges in prod. woo! [00:01:21] Logged the message, Master [00:04:54] binasher: ah. sweet. [00:05:01] that's with the new support? [00:06:14] Ryan_Lane: with a slightly hacked up version of pt-online-schema-changed (fenari:/home/asher/db/pt-online-schema-change-2.1.1-no_child_table_patch) [00:06:24] i need to send them my patches [00:07:09] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [00:09:25] New patchset: Asher; "remove old s2 dbs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7695 [00:09:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7695 [00:09:59] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7695 [00:10:01] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7695 [00:10:58] New patchset: Asher; "fix path to tcpdump" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7696 [00:11:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7696 [00:11:18] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7696 [00:11:20] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7696 [00:14:21] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [00:17:01] kk [00:18:24] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [00:34:04] New patchset: Hashar; "basic header for CommonSettings.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7697 [00:43:28] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7180 [00:43:30] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7180 [00:44:06] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7516 [00:44:08] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7516 [00:44:42] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7576 [00:44:44] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7576 [00:45:27] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7697 [00:45:29] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7697 [00:46:02] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7578 [00:46:04] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7578 [00:46:32] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7182 [00:46:34] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7182 [00:48:16] !log rebooting db51 for kernel upgrade, prior to promoting to s4 master [00:48:18] Logged the message, Master [00:50:30] PROBLEM - Host db51 is DOWN: PING CRITICAL - Packet loss = 100% [00:51:33] RECOVERY - Host db51 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [00:54:01] New patchset: Hashar; "make 10.0.5.8:8420 a global variable" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7702 [00:54:58] !log preparing to rotate s4 master from db31 to db51 [00:55:01] Logged the message, Master [00:55:44] hashar: global $10.0.5.8:8420; looks weird [00:56:18] let me rephrase it [00:56:41] converts hardcoded 10.0.5.8:8420 to new $wmgUdp2logDest [00:56:47] Reedy: would it better this way? [00:57:02] heh [00:57:12] (wants a citrus can?) [00:57:22] I'll be mightily impressed [00:57:37] I can definitely find someone to bring one to you :-]]] [00:58:11] New patchset: Hashar; "convert hardcoded 10.0.5.8:8420 to $wmgUdp2logDest" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7702 [00:58:15] better tittle [00:58:34] !log new s4 master position - MASTER_LOG_FILE='db51-bin.000114', MASTER_LOG_POS=1772578 [00:58:37] Logged the message, Master [00:59:58] PROBLEM - MySQL Replication Heartbeat on db1004 is CRITICAL: CRIT replication delay 215 seconds [01:00:01] !log shutting down mysql on db31, then rebooting [01:00:04] Logged the message, Master [01:00:15] PROBLEM - MySQL Replication Heartbeat on db22 is CRITICAL: CRIT replication delay 230 seconds [01:00:42] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 259 seconds [01:01:00] PROBLEM - MySQL Replication Heartbeat on db1038 is CRITICAL: CRIT replication delay 277 seconds [01:01:09] PROBLEM - MySQL Replication Heartbeat on db51 is CRITICAL: CRIT replication delay 284 seconds [01:01:29] ^^ will clear when s4-master dns change propagates, not an issue [01:02:19] New patchset: Asher; "s4 master -> db51" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7703 [01:02:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7703 [01:03:39] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7703 [01:03:41] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7703 [01:06:42] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay seconds [01:06:42] RECOVERY - MySQL Replication Heartbeat on db1038 is OK: OK replication delay seconds [01:07:09] RECOVERY - MySQL Replication Heartbeat on db1004 is OK: OK replication delay seconds [01:07:18] RECOVERY - MySQL Replication Heartbeat on db22 is OK: OK replication delay seconds [01:15:54] RECOVERY - MySQL Replication Heartbeat on db51 is OK: OK replication delay 0 seconds [01:41:15] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 225 seconds [01:44:06] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [01:46:19] New patchset: Hashar; "implements beta labs specific domains" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7705 [01:46:20] New patchset: Hashar; "override $cluster when on labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7706 [02:15:18] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [02:18:30] * jeremyb scrolls up [02:45:50] Reedy: ping [02:45:55] Ohai [02:46:04] hey, thanks for the merges ;) [02:46:04] Yes, I ran namespaceDupes for 3 wikis [02:46:18] how do you know what i'm going to ask? ;) [02:46:32] i thought only one needed it but can't hurt ;) [02:46:59] Well, 3 had fixable changes [02:47:00] i didn't see it !log'd so i was wondering. (but i did see stuff in the new namespace) [02:47:03] so I guess it was good I should ;) [02:47:29] huh [02:47:40] i guess something can always slip in [02:47:52] but one had widespread + active use i think [02:48:04] they just didn't realize it wasn't really an NS [02:48:35] whatchya think about 7574? [02:48:37] People do some weird stuff [02:50:22] I've not tested it... [02:50:31] But for consensus I could just blame DannyB [02:54:22] Links work at least [02:55:12] bugzilla just gave me i'm just wrapping my brion around all of this [02:55:23] Indeed [02:55:30] We have some amusing quotes [02:55:41] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7574 [03:00:53] Seriously, how did that path conflict [03:03:10] New patchset: Reedy; "Bug 36813 - update wgUploadNavigationUrl on all cs wikis" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7574 [03:04:08] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7574 [03:04:10] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7574 [03:13:36] idk [03:13:49] but i was trying to fix it... [03:13:51] ! [remote rejected] HEAD -> refs/for/master (change 7574 closed) [03:13:57] * jeremyb is too slow! ;-P [03:13:59] Whitespace apparently [03:14:06] yeah, but still idk [03:14:14] there are no lines changed in both i think [03:14:36] the only difference i could see was the space between 'cswiki' and => [03:15:26] i wonder if bug 33919 needs doing for 2013? [03:15:40] (make a search index for it) [03:16:13] No [03:16:20] notpeter gave me a list of all current indexes [03:16:30] i gave him a list of all wikis not on that list (non private) [03:16:43] and he created the indexes [03:19:08] how long was it? [03:19:20] 7 or 8 wikis [03:19:44] Reedy: would you be able to review my changes in operations/mediawiki-config ? [03:19:53] although it is probably late right now [03:20:04] and you are busy ;-) [03:24:58] New patchset: Hashar; "implements beta labs specific domains" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7705 [03:24:58] New patchset: Hashar; "override $cluster when on labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7706 [03:24:59] New patchset: Hashar; "convert hardcoded 10.0.5.8:8420 to $wmgUdp2logDest" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7702 [03:25:11] rebased them to be sure [03:28:36] New review: Hashar; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/6905 [04:19:17] what is news.dblist ? and "p is a symlink to "php" [04:19:20] gah [04:19:31] what is news.dblist ? and "p" is a symlink to "php" why? for convenience? [04:31:15] date has passed: https://gerrit.wikimedia.org/r/gitweb?p=operations/mediawiki-config.git;a=blob;f=wmf-config/CommonSettings.php;hb=e36b459faa1cbee44d205b0cf2439e6a7fb0b0aa#l1827 [04:31:20] should be removed? [04:46:45] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [05:00:51] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [05:07:19] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [05:31:37] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [06:20:30] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [06:29:03] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:17:58] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:37:55] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:42:11] anyone understand how codurr is configured? [07:42:50] i see bits of it in the puppet repo but it's commented out in favor of CIA [07:43:01] and i thought CIA wasn't in use... [08:17:10] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [08:32:37] PROBLEM - Apache HTTP on srv297 is CRITICAL: Connection refused [08:35:37] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [08:37:31] PROBLEM - Apache HTTP on srv296 is CRITICAL: Connection refused [08:39:28] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [08:42:55] RECOVERY - Apache HTTP on srv297 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.036 second response time [08:46:40] PROBLEM - Host srv296 is DOWN: PING CRITICAL - Packet loss = 100% [08:48:28] RECOVERY - Host srv296 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [08:54:46] RECOVERY - Apache HTTP on srv296 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.141 second response time [09:03:19] PROBLEM - Host srv295 is DOWN: PING CRITICAL - Packet loss = 100% [09:04:58] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:05:43] RECOVERY - Host srv295 is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [09:08:43] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [09:09:28] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [09:10:41] eh, srv278, just rebooted but i didnt touch it and did not get upgrades.? any tests? [09:13:40] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [09:14:08] mutante: I vaguealy recollect it rebooting a lot lately [09:15:19] ah [09:15:20] mutante: https://rt.wikimedia.org/Ticket/Display.html?id=24 [09:15:28] I reopened that 2 weeks ago [09:15:40] aha:) thanks [09:15:50] what should we do? [09:16:00] ping rob? [09:16:26] let rob decide if it should reported to Dell or decommissioned and then forward to Chris [09:16:56] okay [09:17:12] New patchset: Hashar; "convert hardcoded 10.0.5.8:8420 to $wmfUdp2logDest" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7702 [09:18:12] New review: Hashar; "(no comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7702 [09:18:49] New review: Hashar; "wmf prefix make more sense. Also we have to fill the value BEFORE including InitialiseSettings.php" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/7702 [09:20:09] New patchset: Hashar; "implements beta labs specific domains" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7705 [09:20:43] New review: Hashar; "Patchset 3 is a rebase on latest merged change." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/7705 [09:21:28] PROBLEM - Puppet freshness on searchidx2 is CRITICAL: Puppet has not run in the last 10 hours [09:21:46] paravoid: [ 32.148486] ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMC (20090903/power_meter-772) [09:22:03] ACPI Error: SMBus or IPMI write requires Buffer of length 42, found length 20 ..shrug [09:22:35] New patchset: Hashar; "override $cluster when on labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7706 [09:22:53] New review: Hashar; "Patchset 3 is a rebase on latest merged change." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/7706 [09:24:33] !log srv278 - still has issues as in reopnened RT #24 - upgrading kernel anyways [09:24:38] Logged the message, Master [09:24:55] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [09:26:43] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:31:04] PROBLEM - Host srv294 is DOWN: PING CRITICAL - Packet loss = 100% [09:33:27] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:33:45] RECOVERY - Host srv294 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [09:34:03] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [09:36:45] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.038 second response time [09:37:12] PROBLEM - Apache HTTP on srv294 is CRITICAL: Connection refused [09:37:12] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:40:03] RECOVERY - Apache HTTP on srv294 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [09:42:54] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:46:21] PROBLEM - Apache HTTP on srv293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:48:18] PROBLEM - Host srv293 is DOWN: PING CRITICAL - Packet loss = 100% [09:48:36] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [09:49:57] RECOVERY - Host srv293 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [09:50:42] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:53:24] RECOVERY - Apache HTTP on srv293 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [10:06:09] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:17:35] New review: Dzahn; "testing if i have +2 now / this is not deployed anywhere yet." [operations/apache-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7024 [10:17:36] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/7024 [10:23:06] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:25:05] New patchset: Hashar; "adding .gitreview" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/7717 [10:26:29] !change 7025 | hashar [10:26:29] hashar: https://gerrit.wikimedia.org/r/7025 [10:27:08] mutante: yeah going to rebase 7025 [10:27:13] cool [10:27:16] New review: Hashar; "(no comment)" [operations/apache-config] (master) C: 1; - https://gerrit.wikimedia.org/r/7717 [10:28:12] New review: Dzahn; "(no comment)" [operations/apache-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7717 [10:28:14] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/7717 [10:29:15] New patchset: Hashar; "ignore some well known scheme and other specific files" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/7025 [10:30:04] New review: Dzahn; "(no comment)" [operations/apache-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7025 [10:30:06] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/7025 [10:30:19] easy stuff [10:30:46] next step would be to migrate the files in fenari to use that repo [10:31:21] yes, replace svn with git clone from this one on fenari as intermediate step before the day it is all puppet deployed was my understanding so far [10:31:49] I don't think it will be puppetized [10:31:51] but that moment needs to be announced and a big warning added to not keep using the old way [10:32:15] by my understanding, we want all to apply the apache conf changes all at the sametime [10:33:45] food time [10:34:09] ok, so what we need next is a script/alias that pulls them from this and then syncs [10:34:14] yep, food good idea [10:34:47] then I will do some accounting ... [10:34:50] I hate accounting [10:35:30] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:35:31] needs do book tickets etc for travel [10:37:37] Laura can definitely help you there :-D [10:37:52] !log on db39 dropped triggers pt_osc_elwiki_recentchanges ins, del, upd, they were preventing all elwiki edits except bot edits with the complaint Table 'elwiki._recentchanges_new' doesn't exist ... binasher, doublecheck me please? [10:37:56] Logged the message, Master [10:37:58] though inside Germany, you will probably fine :-] [10:40:41] I'm pretty sure something weird happened hen when the commonswiki recentchanges schema change went around but he's the only one who will know [10:41:57] PROBLEM - Apache HTTP on srv292 is CRITICAL: Connection refused [10:42:42] the other dbs on s3 did not have triggers (I did a quick check for *TRN, *TRG [10:43:18] RECOVERY - Apache HTTP on srv292 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.167 second response time [10:56:03] PROBLEM - Host srv291 is DOWN: PING CRITICAL - Packet loss = 100% [10:58:00] RECOVERY - Host srv291 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [11:22:46] does anyone around here know about our backups strategy? [11:22:55] around here and around this timezone :) [11:23:35] backups of what in particular? [11:23:48] I've sent a mail a few days ago [11:23:52] about nfs1/2 [11:23:57] ah [11:24:05] that I couldn't find people's ~ to be backed up [11:24:08] I thought /home went on tridge, no? [11:24:10] lemme look [11:24:17] I have no idea, hence asking :) [11:24:23] I'll check [11:25:26] thanks but please give me some info on what you're checking so I won't ask next time :) [11:26:16] I will [11:26:34] :-) [11:35:38] /data/db20/home [11:35:41] now, how I found it. [11:35:58] grep for tridge in puppet repo. [11:36:02] see this suspicious line: [11:36:12] command => '[ -d /home/wikipedia ] && rsync --rsh="ssh -c blowfish-cbc -i /root/.ssh/home-rsync" -azu /home/ [11:36:12] * db20@tridge.wikimedia.org:~/home/' [11:36:22] walk back to see what class it is, where it's included, blah blah [11:36:35] finally realize I should just look on tridge and not on the nfs hosts :-P [11:37:10] fwiw it is listed in the backup status page [11:37:15] http://wikitech.wikimedia.org/view/Backup_status_chart [11:37:32] not that it says where they go exactly but that's not what that page is for really [11:39:43] paravoid: [11:58:01] hello [11:59:16] apergos: oh, thanks [11:59:21] sure [11:59:30] but why are we using rsync for that instead of amanda? [11:59:41] I don't know details [11:59:45] when I checked, I checked /data/amanda [11:59:59] okay then [12:00:14] there is a bunch of stuff that gets rsynced. don't know who set them up nor what the rationale is [12:14:21] PROBLEM - Apache HTTP on srv290 is CRITICAL: Connection refused [12:16:36] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [12:20:50] !log deployment-prep replaced most occurrences of /mnt/upload to /mnt/upload6 [12:20:54] Logged the message, Master [12:22:40] hashar: wrong channel [12:22:48] arf [12:22:53] :) [12:22:54] haha [12:23:33] I guess I should get some sleep [12:23:39] waking up at 1am is no fun :-D [12:23:57] 1am I usually go sleep [12:24:17] I woke up at 1am :-( Took an half hour nap at 6am [12:24:21] that give us ability to be only 24/7 [12:24:22] jet lag sucks [12:24:27] * online [12:24:29] not only [12:24:33] PROBLEM - Host srv290 is DOWN: PING CRITICAL - Packet loss = 100% [12:24:35] I should probably sleep as well [12:25:16] see you later everyone [12:26:30] RECOVERY - Host srv290 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [12:26:39] PROBLEM - NTP on srv290 is CRITICAL: NTP CRITICAL: Offset unknown [12:29:30] RECOVERY - NTP on srv290 is OK: NTP OK: Offset 0.0323921442 secs [12:31:23] !gerrit-search | paravoid [12:31:31] arr,, no.. what was it [12:31:44] @search gerrit [12:31:44] Results (found 2): gerrit, change, [12:34:59] paravoid: anyways, just meant to show the example to search for text in commit message cause they use "subject" as column name, but subject doesnt work, message does: status:open project:operations/puppet message:fix [12:36:47] :) [12:36:55] !gerrit-search is https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+message:$1 [12:36:55] Key was added! [12:37:12] (well that is just for open changes, but if i drop that part it reallly slows down ) [12:39:33] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:42:24] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:42:59] !gerrit-search del [12:42:59] Successfully removed gerrit-search [12:43:08] !gerrit-search is https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+message:$1,n,z [12:43:09] Key was added! [12:47:21] RECOVERY - Apache HTTP on srv290 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [12:58:45] elections. so there it is [13:02:44] new elections ? [13:02:58] couldn't form a govt? [13:05:32] nope [13:08:21] PROBLEM - Apache HTTP on srv255 is CRITICAL: Connection refused [13:21:40] New patchset: Demon; "(bug 36827) Make "bug" case-insentitive for linking" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7724 [13:21:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7724 [13:53:45] PROBLEM - Apache HTTP on srv288 is CRITICAL: Connection refused [13:56:18] PROBLEM - Host srv288 is DOWN: PING CRITICAL - Packet loss = 100% [13:57:48] RECOVERY - Host srv288 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [14:02:36] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:19:51] RECOVERY - Apache HTTP on srv288 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [14:22:15] PROBLEM - Host search22 is DOWN: PING CRITICAL - Packet loss = 100% [14:33:14] RECOVERY - Host search22 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [14:34:35] PROBLEM - Apache HTTP on srv280 is CRITICAL: Connection refused [14:35:06] jeff_green: sudo ...only works on ocg2 [14:35:19] orly [14:36:32] PROBLEM - SSH on search22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:39:23] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:40:44] RECOVERY - SSH on search22 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [14:44:47] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:45:42] fenari strikes again . . . [14:47:38] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:54:05] RECOVERY - Lucene disk space on search22 is OK: DISK OK [14:58:53] RECOVERY - Apache HTTP on srv280 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.024 second response time [15:03:23] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [15:12:15] RECOVERY - NTP on search22 is OK: NTP OK: Offset -0.007666945457 secs [15:17:38] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [15:18:44] New patchset: ArielGlenn; "multiple workers; split out job queue and msg display code" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/7726 [15:33:41] New patchset: Demon; "Allow sudoers on gerrit box(es) to stop|start|restart gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7727 [15:34:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7727 [15:34:43] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [15:41:46] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [16:25:52] PROBLEM - Host srv280 is DOWN: PING CRITICAL - Packet loss = 100% [16:27:04] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.30:11000 (Connection refused) [16:28:13] don't worry, its just srv280 and up again already [16:28:34] RECOVERY - Host srv280 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [16:28:43] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [16:28:43] PROBLEM - NTP on srv280 is CRITICAL: NTP CRITICAL: Offset unknown [16:31:34] RECOVERY - NTP on srv280 is OK: NTP OK: Offset 0.02140009403 secs [16:32:19] PROBLEM - Apache HTTP on srv280 is CRITICAL: Connection refused [16:34:55] !log updating dns for wiki-pedia.org [16:34:59] Logged the message, Master [16:35:19] RECOVERY - Apache HTTP on srv280 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.351 second response time [16:36:55] !log srv app servers max. uptime with older kernel down to ~120 days after another bunch of upgrades [16:36:58] Logged the message, Master [16:37:37] RobHalsell, .wiki-pedia.org should be changed too [16:38:07] Platonides: uh, you mean the apache stuff? cuz i just did the dns, still doing the rest. [16:38:22] or the name registrar nameservers, cuz gthats also in processing. [16:38:54] New patchset: Demon; "Re-attempting links for RT and CodeReview." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6005 [16:39:07] RobHalsell, I mean the IP pointed by the doman, not www. [16:39:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6005 [16:39:18] ie. http://wiki-pedia.org/ which points to 88.191.250.14 [16:40:10] a pity the tld can't be made a CNAME [16:40:19] it points to the same thing as wikipedia.org [16:40:29] if it shows something else for you, its just cached. [16:41:18] 15 07:42:11 < jeremyb> anyone understand how codurr is configured? [16:41:18] 15 07:42:50 < jeremyb> i see bits of it in the puppet repo but it's commented out in favor of CIA [16:41:21] 15 07:43:00 < jeremyb> and i thought CIA wasn't in use... [16:41:39] mutante: I just made some changes to apache files for the cluster, care to give them a second glance for me? [16:41:47] i saw you say somethign in channel last ;] [16:41:49] * Platonides looks how to make dig bypass caches [16:41:59] Platonides: +trace [16:42:01] Platonides: dig @ns0.wikimedia.org wiki-pedia.org [16:42:04] dig @nameserver lookup.fqdn [16:42:06] or that. [16:42:07] or maplebed's [16:42:28] hooray for an echo in IRC's man service. [16:42:29] i use the @ command since i like to query against all three of the WMF nameservers [16:42:47] maplebed: wanna look at an apache config file change? [16:42:51] sure! [16:42:53] where is it? [16:43:07] svn diff in /home/wikipedia/conf/httpd/ [16:43:11] yep, it ends at wikimedia.org. now [16:43:45] i added a rewrite for the wiki-pedia.org to wikipedia.org in redirects.conf [16:44:00] i am pretty sure its all good, but this file is what caused my major cluster outage years ago [16:44:01] RobHalsell: you're gonna hate me. [16:44:04] dig @ns0.wikimedia.org wiki-pedia.org isn't what I wanted, since the .org servers could not be pointing to wikimedia.org [16:44:06] RobHalsell: ok, where, but you are not using git yet, right?:) [16:44:07] maplebed: tab indentation is off [16:44:11] :D [16:44:13] ok, gotcha [16:44:20] maplebed: right? [16:44:26] i notice it in svn diff but didnt in vim [16:44:42] but its not actually off. [16:44:47] the rest of the file uses spaces, you put in a tab. [16:44:51] ahh [16:44:54] well god damn it [16:44:55] lemme fix [16:45:20] Platonides: I'm confused by your comment. [16:45:24] maplebed: ok, check again ;] [16:45:36] mutante: maplebed is checkin, no worries [16:45:46] maplebed, you said Platonides: dig @ns0.wikimedia.org wiki-pedia.org [16:45:46] 18:41:45 dig @nameserver lookup.fqdn [16:46:22] but dig @ns0.wikimedia.org wiki-pedia.org doesn't help if .org root servers pointed to a different dns server [16:46:32] sure it does. [16:46:46]