[00:01:18] !log just completed an online schema change for commonswiki.recentchanges in prod. woo! [00:01:21] Logged the message, Master [00:04:54] binasher: ah. sweet. [00:05:01] that's with the new support? [00:06:14] Ryan_Lane: with a slightly hacked up version of pt-online-schema-changed (fenari:/home/asher/db/pt-online-schema-change-2.1.1-no_child_table_patch) [00:06:24] i need to send them my patches [00:07:09] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [00:09:25] New patchset: Asher; "remove old s2 dbs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7695 [00:09:44] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7695 [00:09:59] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7695 [00:10:01] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7695 [00:10:58] New patchset: Asher; "fix path to tcpdump" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7696 [00:11:18] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7696 [00:11:18] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7696 [00:11:20] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7696 [00:14:21] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [00:17:01] kk [00:18:24] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [00:34:04] New patchset: Hashar; "basic header for CommonSettings.php" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7697 [00:43:28] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7180 [00:43:30] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7180 [00:44:06] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7516 [00:44:08] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7516 [00:44:42] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7576 [00:44:44] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7576 [00:45:27] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7697 [00:45:29] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7697 [00:46:02] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7578 [00:46:04] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7578 [00:46:32] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7182 [00:46:34] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7182 [00:48:16] !log rebooting db51 for kernel upgrade, prior to promoting to s4 master [00:48:18] Logged the message, Master [00:50:30] PROBLEM - Host db51 is DOWN: PING CRITICAL - Packet loss = 100% [00:51:33] RECOVERY - Host db51 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [00:54:01] New patchset: Hashar; "make 10.0.5.8:8420 a global variable" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7702 [00:54:58] !log preparing to rotate s4 master from db31 to db51 [00:55:01] Logged the message, Master [00:55:44] hashar: global $10.0.5.8:8420; looks weird [00:56:18] let me rephrase it [00:56:41] converts hardcoded 10.0.5.8:8420 to new $wmgUdp2logDest [00:56:47] Reedy: would it better this way? [00:57:02] heh [00:57:12] (wants a citrus can?) [00:57:22] I'll be mightily impressed [00:57:37] I can definitely find someone to bring one to you :-]]] [00:58:11] New patchset: Hashar; "convert hardcoded 10.0.5.8:8420 to $wmgUdp2logDest" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7702 [00:58:15] better tittle [00:58:34] !log new s4 master position - MASTER_LOG_FILE='db51-bin.000114', MASTER_LOG_POS=1772578 [00:58:37] Logged the message, Master [00:59:58] PROBLEM - MySQL Replication Heartbeat on db1004 is CRITICAL: CRIT replication delay 215 seconds [01:00:01] !log shutting down mysql on db31, then rebooting [01:00:04] Logged the message, Master [01:00:15] PROBLEM - MySQL Replication Heartbeat on db22 is CRITICAL: CRIT replication delay 230 seconds [01:00:42] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 259 seconds [01:01:00] PROBLEM - MySQL Replication Heartbeat on db1038 is CRITICAL: CRIT replication delay 277 seconds [01:01:09] PROBLEM - MySQL Replication Heartbeat on db51 is CRITICAL: CRIT replication delay 284 seconds [01:01:29] ^^ will clear when s4-master dns change propagates, not an issue [01:02:19] New patchset: Asher; "s4 master -> db51" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7703 [01:02:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7703 [01:03:39] New review: Asher; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7703 [01:03:41] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7703 [01:06:42] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay seconds [01:06:42] RECOVERY - MySQL Replication Heartbeat on db1038 is OK: OK replication delay seconds [01:07:09] RECOVERY - MySQL Replication Heartbeat on db1004 is OK: OK replication delay seconds [01:07:18] RECOVERY - MySQL Replication Heartbeat on db22 is OK: OK replication delay seconds [01:15:54] RECOVERY - MySQL Replication Heartbeat on db51 is OK: OK replication delay 0 seconds [01:41:15] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 225 seconds [01:44:06] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [01:46:19] New patchset: Hashar; "implements beta labs specific domains" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7705 [01:46:20] New patchset: Hashar; "override $cluster when on labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7706 [02:15:18] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [02:18:30] * jeremyb scrolls up [02:45:50] Reedy: ping [02:45:55] Ohai [02:46:04] hey, thanks for the merges ;) [02:46:04] Yes, I ran namespaceDupes for 3 wikis [02:46:18] how do you know what i'm going to ask? ;) [02:46:32] i thought only one needed it but can't hurt ;) [02:46:59] Well, 3 had fixable changes [02:47:00] i didn't see it !log'd so i was wondering. (but i did see stuff in the new namespace) [02:47:03] so I guess it was good I should ;) [02:47:29] huh [02:47:40] i guess something can always slip in [02:47:52] but one had widespread + active use i think [02:48:04] they just didn't realize it wasn't really an NS [02:48:35] whatchya think about 7574? [02:48:37] People do some weird stuff [02:50:22] I've not tested it... [02:50:31] But for consensus I could just blame DannyB [02:54:22] Links work at least [02:55:12] bugzilla just gave me i'm just wrapping my brion around all of this [02:55:23] Indeed [02:55:30] We have some amusing quotes [02:55:41] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7574 [03:00:53] Seriously, how did that path conflict [03:03:10] New patchset: Reedy; "Bug 36813 - update wgUploadNavigationUrl on all cs wikis" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7574 [03:04:08] New review: Reedy; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7574 [03:04:10] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7574 [03:13:36] idk [03:13:49] but i was trying to fix it... [03:13:51] ! [remote rejected] HEAD -> refs/for/master (change 7574 closed) [03:13:57] * jeremyb is too slow! ;-P [03:13:59] Whitespace apparently [03:14:06] yeah, but still idk [03:14:14] there are no lines changed in both i think [03:14:36] the only difference i could see was the space between 'cswiki' and => [03:15:26] i wonder if bug 33919 needs doing for 2013? [03:15:40] (make a search index for it) [03:16:13] No [03:16:20] notpeter gave me a list of all current indexes [03:16:30] i gave him a list of all wikis not on that list (non private) [03:16:43] and he created the indexes [03:19:08] how long was it? [03:19:20] 7 or 8 wikis [03:19:44] Reedy: would you be able to review my changes in operations/mediawiki-config ? [03:19:53] although it is probably late right now [03:20:04] and you are busy ;-) [03:24:58] New patchset: Hashar; "implements beta labs specific domains" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7705 [03:24:58] New patchset: Hashar; "override $cluster when on labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7706 [03:24:59] New patchset: Hashar; "convert hardcoded 10.0.5.8:8420 to $wmgUdp2logDest" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7702 [03:25:11] rebased them to be sure [03:28:36] New review: Hashar; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/6905 [04:19:17] what is news.dblist ? and "p is a symlink to "php" [04:19:20] gah [04:19:31] what is news.dblist ? and "p" is a symlink to "php" why? for convenience? [04:31:15] date has passed: https://gerrit.wikimedia.org/r/gitweb?p=operations/mediawiki-config.git;a=blob;f=wmf-config/CommonSettings.php;hb=e36b459faa1cbee44d205b0cf2439e6a7fb0b0aa#l1827 [04:31:20] should be removed? [04:46:45] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [05:00:51] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [05:07:19] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [05:31:37] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [06:20:30] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [06:29:03] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:17:58] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:37:55] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:42:11] anyone understand how codurr is configured? [07:42:50] i see bits of it in the puppet repo but it's commented out in favor of CIA [07:43:01] and i thought CIA wasn't in use... [08:17:10] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [08:32:37] PROBLEM - Apache HTTP on srv297 is CRITICAL: Connection refused [08:35:37] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [08:37:31] PROBLEM - Apache HTTP on srv296 is CRITICAL: Connection refused [08:39:28] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [08:42:55] RECOVERY - Apache HTTP on srv297 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.036 second response time [08:46:40] PROBLEM - Host srv296 is DOWN: PING CRITICAL - Packet loss = 100% [08:48:28] RECOVERY - Host srv296 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [08:54:46] RECOVERY - Apache HTTP on srv296 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.141 second response time [09:03:19] PROBLEM - Host srv295 is DOWN: PING CRITICAL - Packet loss = 100% [09:04:58] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:05:43] RECOVERY - Host srv295 is UP: PING OK - Packet loss = 0%, RTA = 0.77 ms [09:08:43] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [09:09:28] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [09:10:41] eh, srv278, just rebooted but i didnt touch it and did not get upgrades.? any tests? [09:13:40] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [09:14:08] mutante: I vaguealy recollect it rebooting a lot lately [09:15:19] ah [09:15:20] mutante: https://rt.wikimedia.org/Ticket/Display.html?id=24 [09:15:28] I reopened that 2 weeks ago [09:15:40] aha:) thanks [09:15:50] what should we do? [09:16:00] ping rob? [09:16:26] let rob decide if it should reported to Dell or decommissioned and then forward to Chris [09:16:56] okay [09:17:12] New patchset: Hashar; "convert hardcoded 10.0.5.8:8420 to $wmfUdp2logDest" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7702 [09:18:12] New review: Hashar; "(no comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7702 [09:18:49] New review: Hashar; "wmf prefix make more sense. Also we have to fill the value BEFORE including InitialiseSettings.php" [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/7702 [09:20:09] New patchset: Hashar; "implements beta labs specific domains" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7705 [09:20:43] New review: Hashar; "Patchset 3 is a rebase on latest merged change." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/7705 [09:21:28] PROBLEM - Puppet freshness on searchidx2 is CRITICAL: Puppet has not run in the last 10 hours [09:21:46] paravoid: [ 32.148486] ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMC (20090903/power_meter-772) [09:22:03] ACPI Error: SMBus or IPMI write requires Buffer of length 42, found length 20 ..shrug [09:22:35] New patchset: Hashar; "override $cluster when on labs" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7706 [09:22:53] New review: Hashar; "Patchset 3 is a rebase on latest merged change." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/7706 [09:24:33] !log srv278 - still has issues as in reopnened RT #24 - upgrading kernel anyways [09:24:38] Logged the message, Master [09:24:55] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.030 second response time [09:26:43] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:31:04] PROBLEM - Host srv294 is DOWN: PING CRITICAL - Packet loss = 100% [09:33:27] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:33:45] RECOVERY - Host srv294 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [09:34:03] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [09:36:45] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.038 second response time [09:37:12] PROBLEM - Apache HTTP on srv294 is CRITICAL: Connection refused [09:37:12] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:40:03] RECOVERY - Apache HTTP on srv294 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [09:42:54] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:46:21] PROBLEM - Apache HTTP on srv293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:48:18] PROBLEM - Host srv293 is DOWN: PING CRITICAL - Packet loss = 100% [09:48:36] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [09:49:57] RECOVERY - Host srv293 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [09:50:42] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:53:24] RECOVERY - Apache HTTP on srv293 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [10:06:09] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:17:35] New review: Dzahn; "testing if i have +2 now / this is not deployed anywhere yet." [operations/apache-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7024 [10:17:36] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/7024 [10:23:06] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:25:05] New patchset: Hashar; "adding .gitreview" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/7717 [10:26:29] !change 7025 | hashar [10:26:29] hashar: https://gerrit.wikimedia.org/r/7025 [10:27:08] mutante: yeah going to rebase 7025 [10:27:13] cool [10:27:16] New review: Hashar; "(no comment)" [operations/apache-config] (master) C: 1; - https://gerrit.wikimedia.org/r/7717 [10:28:12] New review: Dzahn; "(no comment)" [operations/apache-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7717 [10:28:14] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/7717 [10:29:15] New patchset: Hashar; "ignore some well known scheme and other specific files" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/7025 [10:30:04] New review: Dzahn; "(no comment)" [operations/apache-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7025 [10:30:06] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/7025 [10:30:19] easy stuff [10:30:46] next step would be to migrate the files in fenari to use that repo [10:31:21] yes, replace svn with git clone from this one on fenari as intermediate step before the day it is all puppet deployed was my understanding so far [10:31:49] I don't think it will be puppetized [10:31:51] but that moment needs to be announced and a big warning added to not keep using the old way [10:32:15] by my understanding, we want all to apply the apache conf changes all at the sametime [10:33:45] food time [10:34:09] ok, so what we need next is a script/alias that pulls them from this and then syncs [10:34:14] yep, food good idea [10:34:47] then I will do some accounting ... [10:34:50] I hate accounting [10:35:30] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:35:31] needs do book tickets etc for travel [10:37:37] Laura can definitely help you there :-D [10:37:52] !log on db39 dropped triggers pt_osc_elwiki_recentchanges ins, del, upd, they were preventing all elwiki edits except bot edits with the complaint Table 'elwiki._recentchanges_new' doesn't exist ... binasher, doublecheck me please? [10:37:56] Logged the message, Master [10:37:58] though inside Germany, you will probably fine :-] [10:40:41] I'm pretty sure something weird happened hen when the commonswiki recentchanges schema change went around but he's the only one who will know [10:41:57] PROBLEM - Apache HTTP on srv292 is CRITICAL: Connection refused [10:42:42] the other dbs on s3 did not have triggers (I did a quick check for *TRN, *TRG [10:43:18] RECOVERY - Apache HTTP on srv292 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.167 second response time [10:56:03] PROBLEM - Host srv291 is DOWN: PING CRITICAL - Packet loss = 100% [10:58:00] RECOVERY - Host srv291 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [11:22:46] does anyone around here know about our backups strategy? [11:22:55] around here and around this timezone :) [11:23:35] backups of what in particular? [11:23:48] I've sent a mail a few days ago [11:23:52] about nfs1/2 [11:23:57] ah [11:24:05] that I couldn't find people's ~ to be backed up [11:24:08] I thought /home went on tridge, no? [11:24:10] lemme look [11:24:17] I have no idea, hence asking :) [11:24:23] I'll check [11:25:26] thanks but please give me some info on what you're checking so I won't ask next time :) [11:26:16] I will [11:26:34] :-) [11:35:38] /data/db20/home [11:35:41] now, how I found it. [11:35:58] grep for tridge in puppet repo. [11:36:02] see this suspicious line: [11:36:12] command => '[ -d /home/wikipedia ] && rsync --rsh="ssh -c blowfish-cbc -i /root/.ssh/home-rsync" -azu /home/ [11:36:12] * db20@tridge.wikimedia.org:~/home/' [11:36:22] walk back to see what class it is, where it's included, blah blah [11:36:35] finally realize I should just look on tridge and not on the nfs hosts :-P [11:37:10] fwiw it is listed in the backup status page [11:37:15] http://wikitech.wikimedia.org/view/Backup_status_chart [11:37:32] not that it says where they go exactly but that's not what that page is for really [11:39:43] paravoid: [11:58:01] hello [11:59:16] apergos: oh, thanks [11:59:21] sure [11:59:30] but why are we using rsync for that instead of amanda? [11:59:41] I don't know details [11:59:45] when I checked, I checked /data/amanda [11:59:59] okay then [12:00:14] there is a bunch of stuff that gets rsynced. don't know who set them up nor what the rationale is [12:14:21] PROBLEM - Apache HTTP on srv290 is CRITICAL: Connection refused [12:16:36] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [12:20:50] !log deployment-prep replaced most occurrences of /mnt/upload to /mnt/upload6 [12:20:54] Logged the message, Master [12:22:40] hashar: wrong channel [12:22:48] arf [12:22:53] :) [12:22:54] haha [12:23:33] I guess I should get some sleep [12:23:39] waking up at 1am is no fun :-D [12:23:57] 1am I usually go sleep [12:24:17] I woke up at 1am :-( Took an half hour nap at 6am [12:24:21] that give us ability to be only 24/7 [12:24:22] jet lag sucks [12:24:27] * online [12:24:29] not only [12:24:33] PROBLEM - Host srv290 is DOWN: PING CRITICAL - Packet loss = 100% [12:24:35] I should probably sleep as well [12:25:16] see you later everyone [12:26:30] RECOVERY - Host srv290 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [12:26:39] PROBLEM - NTP on srv290 is CRITICAL: NTP CRITICAL: Offset unknown [12:29:30] RECOVERY - NTP on srv290 is OK: NTP OK: Offset 0.0323921442 secs [12:31:23] !gerrit-search | paravoid [12:31:31] arr,, no.. what was it [12:31:44] @search gerrit [12:31:44] Results (found 2): gerrit, change, [12:34:59] paravoid: anyways, just meant to show the example to search for text in commit message cause they use "subject" as column name, but subject doesnt work, message does: status:open project:operations/puppet message:fix [12:36:47] :) [12:36:55] !gerrit-search is https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+message:$1 [12:36:55] Key was added! [12:37:12] (well that is just for open changes, but if i drop that part it reallly slows down ) [12:39:33] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:42:24] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [12:42:59] !gerrit-search del [12:42:59] Successfully removed gerrit-search [12:43:08] !gerrit-search is https://gerrit.wikimedia.org/r/#/q/status:open+project:operations/puppet+message:$1,n,z [12:43:09] Key was added! [12:47:21] RECOVERY - Apache HTTP on srv290 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [12:58:45] elections. so there it is [13:02:44] new elections ? [13:02:58] couldn't form a govt? [13:05:32] nope [13:08:21] PROBLEM - Apache HTTP on srv255 is CRITICAL: Connection refused [13:21:40] New patchset: Demon; "(bug 36827) Make "bug" case-insentitive for linking" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7724 [13:21:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7724 [13:53:45] PROBLEM - Apache HTTP on srv288 is CRITICAL: Connection refused [13:56:18] PROBLEM - Host srv288 is DOWN: PING CRITICAL - Packet loss = 100% [13:57:48] RECOVERY - Host srv288 is UP: PING OK - Packet loss = 0%, RTA = 0.20 ms [14:02:36] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:19:51] RECOVERY - Apache HTTP on srv288 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.028 second response time [14:22:15] PROBLEM - Host search22 is DOWN: PING CRITICAL - Packet loss = 100% [14:33:14] RECOVERY - Host search22 is UP: PING OK - Packet loss = 0%, RTA = 0.42 ms [14:34:35] PROBLEM - Apache HTTP on srv280 is CRITICAL: Connection refused [14:35:06] jeff_green: sudo ...only works on ocg2 [14:35:19] orly [14:36:32] PROBLEM - SSH on search22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:39:23] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:40:44] RECOVERY - SSH on search22 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [14:44:47] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:45:42] fenari strikes again . . . [14:47:38] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:54:05] RECOVERY - Lucene disk space on search22 is OK: DISK OK [14:58:53] RECOVERY - Apache HTTP on srv280 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.024 second response time [15:03:23] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [15:12:15] RECOVERY - NTP on search22 is OK: NTP OK: Offset -0.007666945457 secs [15:17:38] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [15:18:44] New patchset: ArielGlenn; "multiple workers; split out job queue and msg display code" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/7726 [15:33:41] New patchset: Demon; "Allow sudoers on gerrit box(es) to stop|start|restart gerrit" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7727 [15:34:00] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7727 [15:34:43] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [15:41:46] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [16:25:52] PROBLEM - Host srv280 is DOWN: PING CRITICAL - Packet loss = 100% [16:27:04] PROBLEM - check_all_memcacheds on spence is CRITICAL: MEMCACHED CRITICAL - Could not connect: 10.0.8.30:11000 (Connection refused) [16:28:13] don't worry, its just srv280 and up again already [16:28:34] RECOVERY - Host srv280 is UP: PING OK - Packet loss = 0%, RTA = 0.23 ms [16:28:43] RECOVERY - check_all_memcacheds on spence is OK: MEMCACHED OK - All memcacheds are online [16:28:43] PROBLEM - NTP on srv280 is CRITICAL: NTP CRITICAL: Offset unknown [16:31:34] RECOVERY - NTP on srv280 is OK: NTP OK: Offset 0.02140009403 secs [16:32:19] PROBLEM - Apache HTTP on srv280 is CRITICAL: Connection refused [16:34:55] !log updating dns for wiki-pedia.org [16:34:59] Logged the message, Master [16:35:19] RECOVERY - Apache HTTP on srv280 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.351 second response time [16:36:55] !log srv app servers max. uptime with older kernel down to ~120 days after another bunch of upgrades [16:36:58] Logged the message, Master [16:37:37] RobHalsell, .wiki-pedia.org should be changed too [16:38:07] Platonides: uh, you mean the apache stuff? cuz i just did the dns, still doing the rest. [16:38:22] or the name registrar nameservers, cuz gthats also in processing. [16:38:54] New patchset: Demon; "Re-attempting links for RT and CodeReview." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6005 [16:39:07] RobHalsell, I mean the IP pointed by the doman, not www. [16:39:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6005 [16:39:18] ie. http://wiki-pedia.org/ which points to 88.191.250.14 [16:40:10] a pity the tld can't be made a CNAME [16:40:19] it points to the same thing as wikipedia.org [16:40:29] if it shows something else for you, its just cached. [16:41:18] 15 07:42:11 < jeremyb> anyone understand how codurr is configured? [16:41:18] 15 07:42:50 < jeremyb> i see bits of it in the puppet repo but it's commented out in favor of CIA [16:41:21] 15 07:43:00 < jeremyb> and i thought CIA wasn't in use... [16:41:39] mutante: I just made some changes to apache files for the cluster, care to give them a second glance for me? [16:41:47] i saw you say somethign in channel last ;] [16:41:49] * Platonides looks how to make dig bypass caches [16:41:59] Platonides: +trace [16:42:01] Platonides: dig @ns0.wikimedia.org wiki-pedia.org [16:42:04] dig @nameserver lookup.fqdn [16:42:06] or that. [16:42:07] or maplebed's [16:42:28] hooray for an echo in IRC's man service. [16:42:29] i use the @ command since i like to query against all three of the WMF nameservers [16:42:47] maplebed: wanna look at an apache config file change? [16:42:51] sure! [16:42:53] where is it? [16:43:07] svn diff in /home/wikipedia/conf/httpd/ [16:43:11] yep, it ends at wikimedia.org. now [16:43:45] i added a rewrite for the wiki-pedia.org to wikipedia.org in redirects.conf [16:44:00] i am pretty sure its all good, but this file is what caused my major cluster outage years ago [16:44:01] RobHalsell: you're gonna hate me. [16:44:04] dig @ns0.wikimedia.org wiki-pedia.org isn't what I wanted, since the .org servers could not be pointing to wikimedia.org [16:44:06] RobHalsell: ok, where, but you are not using git yet, right?:) [16:44:07] maplebed: tab indentation is off [16:44:11] :D [16:44:13] ok, gotcha [16:44:20] maplebed: right? [16:44:26] i notice it in svn diff but didnt in vim [16:44:42] but its not actually off. [16:44:47] the rest of the file uses spaces, you put in a tab. [16:44:51] ahh [16:44:54] well god damn it [16:44:55] lemme fix [16:45:20] Platonides: I'm confused by your comment. [16:45:24] maplebed: ok, check again ;] [16:45:36] mutante: maplebed is checkin, no worries [16:45:46] maplebed, you said Platonides: dig @ns0.wikimedia.org wiki-pedia.org [16:45:46] 18:41:45 dig @nameserver lookup.fqdn [16:46:22] but dig @ns0.wikimedia.org wiki-pedia.org doesn't help if .org root servers pointed to a different dns server [16:46:32] sure it does. [16:46:46] you're specifically asking what ns0 thinks wiki-pedia.org points to. [16:46:55] if it doesnt know, it goes to its master [16:47:04] that's what will be right eventually (once the NS servers are pointing to the right place) [16:47:09] it just happens to be the authoritative nameserver in this case [16:47:21] RobHalsell: or it refuses to recurse [16:47:22] so long as ns0 thinks it's authoritative for the domain, it'll be what's right eventually. [16:47:39] well, as long as that, and the whois shows it as that ;] [16:47:39] I see where you're going [16:47:52] which until 15 minutes ago, it didnt [16:48:04] i still say +trace ;) [16:48:05] maplebed: so my changes look ok? [16:48:18] jeremyb: different method for similar data [16:48:19] RobHalsell: in the rewrite rule, I see the previou stanza terminates the host match (with a $) but yours and the next one don't. Do you know why the match should or should not terminate the host string? [16:48:31] RobHalsell: ok, just fyi. we have a new git repo for apache config now, it is not in use yet, but we did merge files, soon the plan is to change the svn command with cloning from this repo, or at least that is what we are expecting [16:48:46] mutante: oh =/ [16:48:54] RobHalsell: so at some point we may want those changes there now as well [16:49:10] maplebed, probably for a host like wiki-pedia.org.en.wiktionary.org [16:49:12] maplebed: hrmm, i think i should add the termination [16:49:13] jeremyb: +trace is better for showing what's authoritative right now. @ns0 is better if the root nameservers haven't gotten the NS record update yet but we want to see what it'll be once that happens. :D [16:49:39] but it's true, both are very useful. [16:49:45] maplebed: but if it has propagated already then +trace is best [16:50:21] maplebed: yea, i added the wikipaedia stanza as well [16:50:28] so i neglected the $ on it as well [16:50:29] =P [16:50:31] ah. [16:50:35] that makes sense. [16:50:41] hence why the two outliers were mine [16:50:46] look decent now? [16:51:43] RobHalsell: don't worry, right before anything is changed on fenari we will have to merge from svn again anyways, but if you feel like it, check in there as well after you are satisfied with the result being live [16:52:42] I hadn't seen the %1/$1 thing before. that's useful. [16:53:14] RobHalsell: +1 [16:53:20] cool [16:54:51] !log updated apache config for wiki-pedia.org, seems the bot doesnt spam that anymore =[ [16:54:54] Logged the message, Master [16:55:11] I would expect it to spam when you sync it out. [16:55:20] yet it didnt [16:55:30] usually i expect it when i gracefull all apaches [16:55:42] used to say 'so and such gracefulled all apaches' or whatever [16:55:44] hm.. IIRC there are a few different ways to sync, each of which is a wrapper to the next, one of which spams. [16:59:05] time for a taco truck lunch. [16:59:46] huh. [16:59:56] we own wikispecies.{com,net} but not org. [17:00:21] yea its one of the top levels we dont really have [17:00:24] yeah, that bot is broken since a while [17:00:25] there arent a lot of them [17:00:45] ok, back in a bit, going to walk down the block to the taco truck. YAY TACOS! \o/ [17:00:47] I'm surprised. [17:00:52] yay tacos! [17:00:59] yay for wikispecies [17:01:06] \o/ TACOS \o/ TACOS \o/ TACOS \o/ TACOS \o/ TACOS \o/ TACOS \o/ TACOS \O/ [17:01:10] RobHalsell: not tacocopter? [17:01:25] not in DC, or i would totally do it. [17:01:59] oh, yeah. DC [17:02:16] plus http://www.wired.com/gadgetlab/2012/03/qa-with-tacocopter/ [17:02:47] https://twitter.com/TCampDC/statuses/196630521973977089 [17:19:45] New review: Dzahn; "(no comment)" [operations/debs/wikistats] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7573 [17:19:47] Change merged: Dzahn; [operations/debs/wikistats] (master) - https://gerrit.wikimedia.org/r/7573 [17:32:54] New patchset: Aaron Schulz; "Use new thumb purge hook for testwikis." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7731 [17:33:31] New review: Aaron Schulz; "(no comment)" [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/7731 [17:33:38] New review: Aaron Schulz; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7731 [17:33:40] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7731 [17:54:02] New patchset: Pyoungmeister; "adding ganglia collector to carbon in site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7735 [17:54:22] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7735 [17:54:37] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/7735 [17:54:39] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7735 [18:03:07] PROBLEM - Packetloss_Average on oxygen is CRITICAL: XML parse error [18:15:30] hello [18:15:36] nosy's looking into https://jira.toolserver.org/browse/TS-1381 which is "The user specified as a definer ('root'@'208.80.152.165') does not exist" on replication (this is commons, s4) [18:15:53] i'm thinking we should dump grants on both master and TS and compare [18:16:13] someone want to do that for her? [18:16:27] New patchset: Aaron Schulz; "Added swift-backend log." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7737 [18:16:35] jeremyb: possibly a good idea [18:16:45] the ip that tried to send is fenary [18:16:57] thats what the reverse dns lookup says [18:17:01] New review: Aaron Schulz; "(no comment)" [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/7737 [18:17:04] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [18:17:07] New review: Aaron Schulz; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7737 [18:17:09] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7737 [18:18:34] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [18:23:29] maybe mutante, notpeter ? [18:30:38] notpeter: ping? [18:32:17] sup [18:33:24] * notpeter read scrollback [18:34:01] hhhmmmm, I'd ask binasher, as he's effectively our lead dba at this point. I don't feel comfortable making that call [18:35:12] notpeter: ok, thanks, ill ask him then [18:35:29] notpeter: you can't just dump the grants and make a diff? (i.e. not change the DB at all) [18:40:34] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [18:41:41] jeremyb: it's only GRANT ALL PRIVILEGES ON `%wik%`.* TO 'wikiadmin'@'208.80.152.%' [18:42:17] Reedy: surely that's not all. you just can't see them all because you're not sufficiently privileged? [18:42:39] oooh, it's a binasher [18:42:41] binasher: have a sec? [18:43:06] what's up? [18:43:11] hmm, there's a root/repl in the process list [18:43:20] TS replication for s4 [18:43:25] 15 18:15:36 < jeremyb> nosy's looking into https://jira.toolserver.org/browse/TS-1381 which is "The user specified as a definer ('root'@'208.80.152.165') does not exist" on replication (this is commons, s4) [18:43:29] yeah, i emailed nosy a couple hours ago [18:43:29] 15 18:15:52 < jeremyb> i'm thinking we should dump grants on both master and TS and compare [18:43:51] huh, she was just here a few mins ago and didn't say anything about that. (but she did say she mailed you) [18:44:41] "this was a result of me performing an online schema change. More of these will happen on all wikis. Granting all privileges to root @ fenari in all dbs should be safe, let you resume s4, and prevent further issues. Since that ip can't hit your db, it shouldn't be a security issue. " … "Triggers replicate a bit differently than normal queries and need to run as the user that created them instead of as the normal replication user [18:44:46] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [18:45:08] New patchset: Pyoungmeister; "adding if statement for precise to install oracle java instead of sun java" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7739 [18:45:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7739 [18:45:28] ohhh, i knew triggers were special. i just couldn't imagine why they would be used [18:45:38] OSC makes sense [18:46:12] also from my email to nosy "The osc creates a new table with the new schema, copies from the old to the new in small chunks, and creates triggers on the old table so that changes are propagated to the new table during the copy. Then it switches them and removes the original when done. That way you can run alters on a master under certain conditions without locking." [18:46:33] Fancy [18:47:11] binasher: do you get all of that (triggers and copying) propagated to all slaves? i guess so or this wouldn't be an issue [18:47:51] binasher: there must be some limit on what kind of queries it works with though? must be idempotent? [18:48:13] UPDATE foo SET bar=bar+1; [18:48:52] New patchset: Pyoungmeister; "adding if statement for precise to install oracle java instead of sun java" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7739 [18:49:11] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7739 [18:50:16] jeremyb: that's query is fine, handled with an "after update" trigger that calls a replace into on the temp table with the new value of the changed row [18:50:51] ahh [18:51:36] there are limitations on the type of alter you can run though, and a requirement that the existing table have at least one unique key of some sort [18:52:35] and things get more complicated if you have real foreign key relationships [18:52:37] does it detect unsupported alters and stop you? [18:52:42] or you just have to know [18:57:48] it will at least stop you by failing without destroying any data :) but may leave you needing to clean things up via dropping the target table or triggers [18:58:55] but not fail fast i guess [19:00:04] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [19:02:24] it does make some checks for its requirements, and does a "create table like" and runs the actual alter on the target table first and fails quickly there, but there's definitely room to get into trouble after those steps [19:17:01] RECOVERY - Packetloss_Average on oxygen is OK: OK: packet_loss_average is -0.0876528099174 [19:21:22] PROBLEM - Packetloss_Average on oxygen is CRITICAL: XML parse error [19:21:41] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [19:21:58] PROBLEM - Puppet freshness on searchidx2 is CRITICAL: Puppet has not run in the last 10 hours [19:30:33] hiya [19:30:43] there is udp2log packet loss on oxygen right now [19:30:45] trying to find out why [19:31:03] the squid logging multicast relay is using 75% cpu [19:31:38] trying to understand how that works, and what it does [19:31:43] what machine is 233.58.59.1? [19:32:18] notpeter, any idea? [19:33:21] ottomata: that's a multicast address [19:33:23] it's a multicast address, so it's not just a machine [19:33:35] the relay should be on oxygen [19:33:45] aye right right [19:33:52] yeah the relay is there [19:34:33] is there anyone reading from the relay right now? [19:34:50] can I stop it for a minute? [19:34:54] what is using the relay? [19:35:26] any host that is a member of the group with the IP you listed [19:35:34] ottomata: the udp2log instance on oxygen is using it.... [19:35:41] somewhere the group members must be configured afaik [19:36:10] ottomata: the relay is unicast to multicast [19:36:18] if you stop the relay we'll stop getting logs from esams [19:36:28] ok [19:36:31] won't stop [19:39:02] New patchset: Aaron Schulz; "Disabled new hook due to loop bug." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7742 [19:39:04] so only oxygen's udp2log instance is using the multicast relay? [19:39:17] yes [19:39:26] New review: Aaron Schulz; "(no comment)" [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/7742 [19:39:33] New review: Aaron Schulz; "(no comment)" [operations/mediawiki-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7742 [19:39:35] is there a reason to use multicast then? [19:39:35] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7742 [19:39:36] so, if you stop it, all packet loss will stop [19:39:48] haha, right, because all packets will be lost? :p [19:40:08] the reason is to be able to scale grabbing the logs [19:40:12] s/lost/not sent/! [19:40:26] so [19:40:53] socat listens on oxygen on port 8419, and then multicasts to 233.58.59.1:8420 [19:40:59] which only oxygen is subscribing to [19:40:59] yes [19:41:02] yes [19:41:03] (right now) [19:41:06] yes [19:41:28] i could also just set up a 2nd udp2log instance on oxygen listening on port 8419 [19:41:33] and kill the socat? [19:42:19] Poor cat [19:42:26] I mean, ideally, we'd be transitioning all of the udp2log instances to use the multicast relay [19:42:31] not go the other direction [19:43:11] Reedy: responsible for this (pointed out by jeremy) http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=613945 [19:43:12] notpeter: ha ha ha [19:43:17] cuz if, say, you ever want to spin up another udp2log instance, you're going to want that multicast relay [19:43:22] notpeter: I literally just said that to Ryan_Lane [19:44:51] right [19:45:00] but somethign is causing packet loss on oxygen now [19:45:05] maybe the relay should be elsewhere [19:45:16] not on the same machine that is also writing the logs [19:46:23] I mean, if you need more boxes, ask for them [19:46:41] but I think turning of the relay is not the direction to go [19:48:00] maybe not long term [19:48:04] but if there is packet loss right now [19:48:06] people get angry [19:49:40] PROBLEM - Puppet freshness on db29 is CRITICAL: Puppet has not run in the last 10 hours [20:06:17] notpeter: the puppet rule that puts 99-big-rmem.conf in place [20:06:27] needs to "sysctl -p /etc/sysctl.d/99-big-rmem.conf" after to make it go live without waiting for a reboot [20:06:31] * binasher manually did that [20:07:19] is it possible that we would see these alerts for packet loss from only one log machine? [20:07:35] most of the loss numbers are low, except for [20:07:35] sq86.wikimedia.org lost: (28.30189 +/- 16.44868)% [20:07:50] now i've restarted udp2log [20:07:50] a [20:07:51] n [20:07:55] and its running with --recv-queue=524288 [20:08:16] that margin of error is pretty huge [20:08:26] yeah true [20:09:05] socat has been dropping some packets though [20:09:53] it doesn't have an adjustable buffer size like udp2log, so would need to alter rmem_default and restart that [20:16:41] so, binasher, does/will that fix the packet loss problem? [20:16:55] and, are you saying that the sysctl setting needs to be permanently puppetized? [20:21:11] thought I ran that, but yeah, put it into puppet is the right call [20:34:14] New patchset: Ottomata; "Reloading sysctl when rmem_max file changes." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7751 [20:34:33] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7751 [20:34:55] notpeter, binasher, how's that? [20:34:58] helpful? [20:44:39] does anyone know how to regenerate the wikiversions.cdb file ? [20:45:10] AaronSchulz: ---^^ [20:45:24] that part is not complicated enough [20:45:39] I think it should need some more level of file abstractions and several caches :-D [20:45:51] populateWikiversionCDB.php [20:45:53] sounds good [20:48:56] the beta labs cluster is fully screwed hehe [21:01:10] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [21:01:43] hashar: sync-wikiversions regenerates it and pushes [21:01:46] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [21:01:54] Reedy: yeah should [21:01:59] RoanKattouw: Have you seen/read https://bugzilla.wikimedia.org/show_bug.cgi?id=36839 ? If not, ahve a look at comment 10 [21:02:25] Reedy: trying to do it on labs, it ends up /h/w/c does not have 1.20wmf2 / 1.20wmf3 [21:02:40] Reedy: we will need to make them submodules one day :-) [21:02:44] heh [21:02:56] I am trying to get foreachwiki to work :-D [21:03:07] Reedy: OOOOOOOH [21:03:14] Dude I was totally puzzled as to why this was happening [21:03:47] OK [21:04:01] IIRC you used 1000 as an overly conservative value? [21:04:19] I guess so [21:04:37] We could set it to 10k maybe [21:04:48] But still, this bug is going to happen [21:04:55] I wonder why this particular regex recurses so deeply [21:05:32] Oh [21:05:41] It's (onething|twothings|threethings)+ [21:08:49] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [21:11:35] RoanKattouw: maybe it doesn't *actually* recurse, comment 10 sounds like if you have a greedy regex you'll exhaust the "recursion" limit simply by matching beyond what the regex was looking for. not sure if it's true in PHP, but it's the difference between ".+" and ".+?" in Ruby. [21:12:43] wikiversions.cdb successfully built. [21:12:45] oh yeah! [21:12:47] (on labs) [21:12:51] have a good night [21:13:15] thanks again hashar [21:13:26] as I said [21:13:31] there is a ton of prerequisites :-D [21:13:42] at least we now have 1.20wmf branches and multiversions [21:13:49] need to do some DB upgrades [21:13:53] \O/ [21:13:57] of master [21:14:15] will be for tomorrow, for now, I am head bed :-] [21:14:25] will investigate the extension issue tomorrow [21:14:29] bye!! [21:16:16] chrismcmahon: Aha, using +? works [21:16:26] Well, sort of [21:17:02] It starts failing at 1023 chars instead [21:17:31] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [21:18:34] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [21:21:04] RoanKattouw: I don't know much CS, but I've been wrestling regexes a long time. testers tend to do that. [21:21:43] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [21:38:47] @replag [21:38:49] Krinkle: No replag currently. See also "replag all". [21:39:58] hi guyyyys, gonna be noisy for a minute about some gerrit changes I have that are waiting for review [21:40:09] notpeter: these two are for you: [21:40:10] https://gerrit.wikimedia.org/r/#/c/6798/ [21:40:14] https://gerrit.wikimedia.org/r/#/c/7751/ [21:40:28] @replag [21:40:29] Krinkle: [s1] db12: 12s; [s5] db44: 1s [21:43:37] and this one, hmmm [21:43:39] maybe maplebed? [21:43:39] https://gerrit.wikimedia.org/r/#/c/7285/ [21:44:00] hmmm... [21:44:18] :D [21:44:36] New patchset: Bhartshorne; "adding in configuration option to only write thumbs to some containers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7753 [21:44:38] we can trade! [21:44:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7753 [21:47:37] haa, ok! [21:49:31] oo, Q [21:49:33] is it better to do [21:49:35] scope.lookupvar("swift::proxy::config::write_thumbs") [21:49:36] than just [21:49:40] $swift::proxy::config::write_thumbs [21:49:44] ohhh, or is that not valid ruby? [21:49:46] with the :: [21:49:46] ? [21:49:52] I think maybe it's not valid. [21:49:55] aye [21:50:01] the scope.lookup thing is what ryan told me to do. [21:50:08] ok, just curious [21:50:20] i've used that before, but only when I was trying to do fancier things with the variable if it wasn't set [21:50:25] apparently the way puppet looks up scope is changing across some version around now [21:50:53] but I don't really understand the details. [21:51:28] how is the statistics::db different from generic::mysql? [21:51:53] oh. [21:51:56] I don't think you want the mysql::stuff. [21:52:06] I think that's specific to cluster dbs. [21:52:19] binasher: do you remember which dbs are allowed to use mysql:: stuff and which aren't? [21:52:19] New review: Ottomata; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/7753 [21:52:21] (puppet) [21:52:39] um, i think i looked at it, most of it was pretty generic [21:52:44] lemme double check [21:53:06] most, yes, but there's some weirdness about the percona stuff and cluster membership. [21:53:26] hmmmmm [21:53:34] well, i manually included the packages I wanted [21:53:40] include mysql::packages, [21:53:40] 125 » » mysql::mysqluser, [21:53:41] 126 » » mysql::mysqlpath, [21:53:41] 127 » » mysql::datadirs [21:53:56] packages just installs from debs [21:54:11] is there a problem with those packages for a stand alone mysql server? [21:54:28] maybe not... asher's done some work here since I last tried. [21:54:33] I'm reading more carefully now. [21:55:16] yeah, you're right; it looks like everything you've included is fine. [21:56:14] ok cool [21:56:24] do you want a +1 or +2+merge? [21:56:38] this is just for fabian, he wanted to be able to load some data into a local db so he could run some simple queries [21:56:40] +2 please :) [21:56:45] my mysql packages won't give you a my.cnf [21:56:48] I'd +2 you but I don't have the powers [21:56:49] that's fine [21:56:54] i'll tweak and then puppetize? [21:56:54] i wouldn't really recommend it for a one off [21:56:56] zat ok? [21:57:47] binasher: what would you suggest? [21:58:09] i think we had this conversation.. a generic mysql class that just installs the distro stuff [21:58:20] oh, rather than the wikimedia debs? [21:58:25] yes [21:58:36] i can do that [21:58:41] so there's generic::mysql::server [21:58:54] just saying, there are mariadb classes in labs that just install a standalone server, but still dunno about the mariadb vs. mysql, (i like not having to Oracle message on startup) [21:58:55] but it *just* pulls in the mysql server package, not the lvm stuff or the percona tools or maatkit or ... [21:58:59] * mutante hides again [21:59:13] oh! [21:59:17] how'd I miss those [21:59:34] probably because you though "Hm, I want mysql" and so went straight to mysql.pp. [21:59:37] geez ok, maplebed, don't approve for now, i will fix, and then poke again later [21:59:38] like a sane person. [21:59:41] haha [22:00:01] /kick maplebed [22:00:06] ;) [22:00:45] (I'm pretty sure I put in the generic::mysql::server class.) [22:01:19] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:04:17] New patchset: Bhartshorne; "adding in configuration option to only write thumbs to some containers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7753 [22:04:37] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7753 [22:05:18] New review: Ottomata; "Going to use generic::mysql::* classes instead." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/7285 [22:09:08] New patchset: Bhartshorne; "adding in configuration option to only write thumbs to some containers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7753 [22:09:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7753 [22:12:52] New patchset: Bhartshorne; "adding in configuration option to only write thumbs to some containers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7753 [22:13:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7753 [22:15:29] New patchset: Bhartshorne; "adding in configuration option to only write thumbs to some containers" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7753 [22:15:49] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7753 [22:17:40] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [22:19:46] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [22:21:18] New review: Aaron Schulz; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/7753 [22:30:25] !/ [22:30:41] !log snowolf is gay and autistic [22:30:44] Logged the message, Master [22:31:15] RobH: ^ [22:31:39] lmao how does it feel to be powerless snowolf [22:32:19] .... [22:32:28] bah, now i have to look up the ban command [22:32:31] this client sucks [22:32:37] Gotta love auto rejoin [22:32:37] RobH: op me [22:32:43] I'll ban :) [22:32:45] /mode #wikimedia-operations +b angryblackman [22:32:51] Damianz: no doesnt work like that [22:33:04] Snowolf: just put the command in channel without the / to start [22:33:08] Don't do it Rob, he'll take over. He's a power hungry motherfucker [22:33:13] /mode #wikimedia-operations +b *!*@mobile-198-228-225-042.mycingular.net [22:33:27] and then kick again [22:33:35] yay [22:33:51] :) [22:34:04] i thought it was the same mask as the invite exemption stuffs, but wasnt sure and limechat doesnt have ban in the kick options [22:34:07] Well my client is nice and expands my *'s for me :P [22:34:28] i have yet to find awesome graphical os x irc client [22:35:41] Snowolf: so yea i can pull that bit of cruft off the wikitech admin log [22:35:52] but the tweet and such is already out there and such [22:35:57] figured you prolly dont care [22:36:11] we get random log bot trolling once every few months. [22:36:42] It's more insulting that he thought gay or autistic was insulting :P [22:36:57] RobH: heh, I still would appreciate if you could remove it from the log anyway [22:37:01] if it happened more often we would put in some kind of whitelist for logging, but it really doesnt [22:37:10] but yea, i am pulling up wikitech to pull from there [22:37:30] Thanks [22:41:55] yea its off wikitech, twitter, and identica now [22:41:59] so all good. [22:46:35] RobH: much appreciated, sorry for the bother [22:50:30] RobH: linkinus didn't pass your tests? [22:51:34] I thought we had resolved the thing that was annoying you. [22:55:00] its great, but not open source and free is all [22:55:10] so doing my best with other options before i cave ;] [22:57:30] you won't be tainted by using a piece of commercial software. [22:57:36] (much) [22:57:38] RobH: MacIrssi :-) [23:13:26] Hey guys, I've been getting lots of API errors from various tools for the past half hour. The error is always "ERR_SOCKET_FAILURE, errno (98) Address already in use". [23:13:57] mutante mentioned that some of the API servers had to be restarted, but it seems strange they are still giving errors [23:18:36] Beuller? [23:18:57] Doing any specific requests? [23:19:12] all sort of things.... [23:19:39] edit's via Twinkle, requesting data via PageTriage, et.c [23:20:36] I suspect it's not giving server names? [23:21:00] Request: POST http://en.wikipedia.org/w/api.php, from 10.64.0.136 via cp1004.eqiad.wmnet (squid/2.7.STABLE9) [23:21:07] Request: POST http://en.wikipedia.org/w/api.php, from 10.64.0.136 via cp1004.eqiad.wmnet (squid/2.7.STABLE9) [23:21:18] PROBLEM - swift-container-auditor on ms-be5 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [23:21:28] those are the only 2 I saved [23:22:23] here's another one: [23:22:23] Request: POST http://en.wikipedia.org/w/api.php, from 10.64.0.125 via cp1004.eqiad.wmnet (squid/2.7.STABLE9) [23:23:11] all one squid [23:25:39] RECOVERY - swift-container-auditor on ms-be5 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [23:27:53] !log upgraded ganglia-monitor and gmetad from 3.1.2-2.1 to 3.3.5-2 [23:27:56] Logged the message, Master [23:48:48] I still get the error quite regularly (doing QA testing on PageTriage). Should I report this problem as an RT ticket or is someone investigating? [23:49:39] * Reedy attempts to replicate [23:50:38] http://en.wikipedia.org/wiki/Special:NewPagesFeed is a good page to test on since the interface won't load if the API request fails [23:51:38] It just says "Please wait" forever :) [23:55:19] binasher: we just wrapped up a MF deployment and need a varnish cache flush - can you hook it up? [23:55:32] sure [23:55:38] thanks dude [23:56:04] should be flushed [23:56:47] cool thanks