[00:04:00] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:08:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.761 seconds [00:11:41] !log started process to delete objects that don't exist in the container listings on all swift backends [00:11:44] Logged the message, Master [00:36:30] New patchset: Hashar; "Apache files public on noc.wikimedia.org/conf/" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/7024 [00:36:31] New patchset: Hashar; "ignore some well known scheme and other specific files" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/7025 [00:37:04] Ryan_Lane: paravoid here are the Apache configuration from cluster ^^^ [00:37:04] they are basically the one already publicly available on http://noc.wikimedia.org/ [00:38:49] hashar: http://cdn.memegenerator.net/instances/400x/20168942.jpg [00:38:57] ROFL [00:39:16] I would prefer someone else to take the responsibility :-D [00:39:41] moreover, some files are probably obsoletes / deprecated [00:42:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:49:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.873 seconds [01:11:01] binasher: https://gerrit.wikimedia.org/r/7026 [01:12:52] New review: Asher; "(no comment)" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/7026 [01:23:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:28:02] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [01:30:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.910 seconds [01:32:14] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [01:59:05] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [02:04:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:11:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.196 seconds [02:13:02] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [05:27:58] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [05:37:52] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [05:40:43] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [05:46:25] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2338 [05:49:28] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [06:00:34] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [06:08:58] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [07:15:20] New patchset: ArielGlenn; "create tarballs of media uploaded locally and remote (to commons) per wiki" [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/7034 [07:20:03] moin apergos [07:20:15] them tarballs will be how big? [07:22:38] ooh, filesPerTarball = 100000 [07:23:29] well I might change that [07:23:43] first of course I can give an arg but second I will likely have lists that have the size in em [07:24:06] so I will likely have a max size arg [07:24:34] anyways the sad thing is that I have been generating these for a few days now but there is some zfs/nfs issue that is making them go unbrearably slow [07:24:52] the guy at the site donating the space and hardware is looking into it [07:25:26] these are created per wiki [07:25:40] huh, you've got an offsite zfs box? [07:25:52] well the zfs wasn't my idea [07:26:03] it's their typical setup though and they know how to tune it [07:26:06] (freebsd) [07:26:32] ok... are they importing zfs from you though? or it's rsync or what? [07:26:41] ah no [07:26:51] rsync from a local mmirror of media [07:26:58] and we build from that [07:26:58] also, i learned today that freebsd is in CVS. ;-( ;-( [07:27:05] heh heh [07:28:32] New review: ArielGlenn; "(no comment)" [operations/dumps] (ariel); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7034 [07:28:34] Change merged: ArielGlenn; [operations/dumps] (ariel) - https://gerrit.wikimedia.org/r/7034 [07:29:23] and now the copy in git is out of syncwith the copy they are running but whatevs, I'll fix that up later [07:30:00] btw, there's whitespace at the end of some lines in that commit [07:34:16] figures [07:34:19] someday [07:34:22] (not today) [07:34:29] I should go through all my code and [07:34:35] clean up whitespace [07:34:49] clean up various little crap that I've since learned nicer ways to do [07:34:50] etc etc [07:46:07] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [08:02:55] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [08:10:34] PROBLEM - Frontend Squid HTTP on amssq62 is CRITICAL: Connection refused [08:10:34] PROBLEM - Backend Squid HTTP on amssq62 is CRITICAL: Connection refused [08:14:46] RECOVERY - Backend Squid HTTP on amssq62 is OK: HTTP OK HTTP/1.0 200 OK - 635 bytes in 0.218 seconds [08:14:55] RECOVERY - Frontend Squid HTTP on amssq62 is OK: HTTP OK HTTP/1.0 200 OK - 656 bytes in 0.220 seconds [08:24:58] PROBLEM - Frontend Squid HTTP on amssq31 is CRITICAL: Connection refused [08:26:01] PROBLEM - Backend Squid HTTP on amssq31 is CRITICAL: Connection refused [08:31:34] RECOVERY - Backend Squid HTTP on amssq31 is OK: HTTP OK HTTP/1.0 200 OK - 27577 bytes in 0.659 seconds [08:32:10] RECOVERY - Frontend Squid HTTP on amssq31 is OK: HTTP OK HTTP/1.0 200 OK - 27735 bytes in 0.439 seconds [08:42:47] PROBLEM - Host ms1004 is DOWN: PING CRITICAL - Packet loss = 100% [08:56:35] PROBLEM - Frontend Squid HTTP on amssq61 is CRITICAL: Connection refused [08:57:20] PROBLEM - Backend Squid HTTP on amssq61 is CRITICAL: Connection refused [09:00:20] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:04:32] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:17:08] RECOVERY - Backend Squid HTTP on amssq61 is OK: HTTP OK HTTP/1.0 200 OK - 632 bytes in 0.218 seconds [09:17:53] RECOVERY - Frontend Squid HTTP on amssq61 is OK: HTTP OK HTTP/1.0 200 OK - 656 bytes in 0.218 seconds [09:19:50] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:22:05] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:24:47] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:24:48] !log started container-auditor on ms-be3 and 4 [09:24:52] Logged the message, Master [09:25:32] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:35:38] PROBLEM - Backend Squid HTTP on amssq41 is CRITICAL: Connection refused [09:35:56] PROBLEM - Frontend Squid HTTP on amssq41 is CRITICAL: Connection refused [09:52:35] PROBLEM - Backend Squid HTTP on amssq60 is CRITICAL: Connection refused [09:53:20] PROBLEM - Frontend Squid HTTP on amssq60 is CRITICAL: Connection refused [09:55:26] RECOVERY - Backend Squid HTTP on amssq41 is OK: HTTP OK HTTP/1.0 200 OK - 27566 bytes in 0.658 seconds [09:55:35] RECOVERY - Frontend Squid HTTP on amssq41 is OK: HTTP OK HTTP/1.0 200 OK - 27737 bytes in 0.438 seconds [09:56:11] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:04:44] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [10:15:14] RECOVERY - Backend Squid HTTP on amssq60 is OK: HTTP OK HTTP/1.0 200 OK - 633 bytes in 0.219 seconds [10:15:50] RECOVERY - Frontend Squid HTTP on amssq60 is OK: HTTP OK HTTP/1.0 200 OK - 656 bytes in 0.219 seconds [10:26:04] New patchset: Dzahn; "Apache files public on noc.wikimedia.org/conf/ (removed whitespace on import)" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/7024 [10:27:25] New review: Dzahn; "+1 for putting them in git. removed the whites space marked by gerrit in patch set2 to cleanup on in..." [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/7024 [10:29:16] New review: Dzahn; "+1 is all i am technically allowed on this project." [operations/apache-config] (master) C: 1; - https://gerrit.wikimedia.org/r/7024 [10:34:00] New review: Dzahn; "did not mean to break your dependency, by uploading patch set 2 on the other one, just meant to help..." [operations/apache-config] (master) C: 1; - https://gerrit.wikimedia.org/r/7025 [10:42:22] PROBLEM - Backend Squid HTTP on amssq42 is CRITICAL: Connection refused [10:42:22] PROBLEM - Frontend Squid HTTP on amssq42 is CRITICAL: Connection refused [11:04:52] RECOVERY - Frontend Squid HTTP on amssq42 is OK: HTTP OK HTTP/1.0 200 OK - 27737 bytes in 0.438 seconds [11:04:52] RECOVERY - Backend Squid HTTP on amssq42 is OK: HTTP OK HTTP/1.0 200 OK - 27575 bytes in 0.656 seconds [11:18:13] PROBLEM - swift-container-auditor on ms-be3 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:23:46] RECOVERY - swift-container-auditor on ms-be3 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [11:43:20] PROBLEM - Host amssq43 is DOWN: PING CRITICAL - Packet loss = 100% [11:44:50] RECOVERY - Host amssq43 is UP: PING OK - Packet loss = 0%, RTA = 108.84 ms [11:48:34] PROBLEM - Backend Squid HTTP on amssq43 is CRITICAL: Connection refused [11:48:34] PROBLEM - Frontend Squid HTTP on amssq43 is CRITICAL: Connection refused [11:57:53] RECOVERY - Backend Squid HTTP on amssq43 is OK: HTTP OK HTTP/1.0 200 OK - 27577 bytes in 0.655 seconds [11:57:53] RECOVERY - Frontend Squid HTTP on amssq43 is OK: HTTP OK HTTP/1.0 200 OK - 27737 bytes in 0.436 seconds [12:17:14] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:18:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.029 seconds [12:46:24] PROBLEM - Lighttpd HTTP on dataset2 is CRITICAL: Connection refused [12:51:12] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:51:43] grr [12:52:33] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.007 seconds [12:53:18] RECOVERY - Lighttpd HTTP on dataset2 is OK: HTTP OK HTTP/1.0 200 OK - 5326 bytes in 0.006 seconds [13:09:57] New patchset: Hydriz; "Small fix: Changing the year from 1012 to 2012. Small typo in the puppet file." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7053 [13:10:16] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7053 [13:12:04] New review: Dzahn; "I believe that it wasn't in 1012 :)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7053 [13:12:06] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7053 [13:21:39] PROBLEM - Frontend Squid HTTP on amssq44 is CRITICAL: Connection refused [13:21:39] PROBLEM - Backend Squid HTTP on amssq44 is CRITICAL: Connection refused [13:26:54] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:32:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.868 seconds [13:33:21] RECOVERY - Frontend Squid HTTP on amssq44 is OK: HTTP OK HTTP/1.0 200 OK - 27735 bytes in 0.437 seconds [13:33:21] RECOVERY - Backend Squid HTTP on amssq44 is OK: HTTP OK HTTP/1.0 200 OK - 27577 bytes in 0.654 seconds [13:46:42] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:03:12] goooood morning! (us east coast!) [14:03:22] PROBLEM - swift-container-auditor on ms-be1 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:03:33] got a change waiting for review since yesterday, can I get a helper? [14:03:34] https://gerrit.wikimedia.org/r/#/c/6923/ [14:03:40] pleeeaaase? [14:03:42] :) [14:05:10] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:06:13] RECOVERY - swift-container-auditor on ms-be1 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:06:29] !log started container-auditor on ms-be1 [14:06:34] Logged the message, Master [14:06:44] ottomata: if you are not expecting me to check those IPs to match providers, i think yea [14:07:01] yeah, i had diederik check that [14:08:09] New review: Dzahn; "diederik checked IPs, looks reasonable" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/6923 [14:08:12] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6923 [14:09:57] ottomata: done on puppetmaster [14:10:18] danke! [14:10:29] bitte [14:10:30] running puppet on oxygen now to verify... [14:11:26] LOL [14:12:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.598 seconds [14:14:19] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:17:20] PROBLEM - udp2log log age for oxygen on oxygen is CRITICAL: CRITICAL: log files /a/squid/saudi-telecom.log, have not been written to in 24 hours [14:18:04] PROBLEM - udp2log processes for oxygen on oxygen is CRITICAL: CRITICAL: filters absent: /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/bin/udp-filter, /usr/local/bin/packet-loss, [14:18:40] RECOVERY - udp2log log age for oxygen on oxygen is OK: OK: all log files active [14:19:25] RECOVERY - udp2log processes for oxygen on oxygen is OK: OK: all filters present [14:20:55] those are ok [14:21:07] we just deployed new filters on oxygen, was restarting udp2log and checking them out [14:21:08] looking good [14:23:04] recovery looks good,k [14:44:01] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [14:46:16] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:48:22] PROBLEM - Backend Squid HTTP on amssq45 is CRITICAL: Connection refused [14:48:31] PROBLEM - Frontend Squid HTTP on amssq45 is CRITICAL: Connection refused [14:52:38] those are also ok, still rebooting [14:54:58] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.020 seconds [14:55:16] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [15:09:34] RECOVERY - Backend Squid HTTP on amssq45 is OK: HTTP OK HTTP/1.0 200 OK - 27575 bytes in 0.436 seconds [15:09:52] RECOVERY - Frontend Squid HTTP on amssq45 is OK: HTTP OK HTTP/1.0 200 OK - 27734 bytes in 0.438 seconds [15:27:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:36:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.024 seconds [15:43:28] PROBLEM - swift-container-auditor on ms-be4 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [15:43:28] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [15:52:28] PROBLEM - Backend Squid HTTP on amssq46 is CRITICAL: Connection refused [15:52:46] PROBLEM - Frontend Squid HTTP on amssq46 is CRITICAL: Connection refused [16:01:28] PROBLEM - Puppet freshness on storage3 is CRITICAL: Puppet has not run in the last 10 hours [16:05:04] RECOVERY - Backend Squid HTTP on amssq46 is OK: HTTP OK HTTP/1.0 200 OK - 27576 bytes in 0.656 seconds [16:05:31] RECOVERY - Frontend Squid HTTP on amssq46 is OK: HTTP OK HTTP/1.0 200 OK - 27735 bytes in 0.438 seconds [16:06:15] mark: ping [16:07:56] Hello, is there anybody around that can push a Varnish VCL change for me? [16:08:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:10:24] robh: [16:10:28] !rt 2857 [16:10:28] https://rt.wikimedia.org/Ticket/Display.html?id=2857 [16:10:42] can we decom? [16:11:05] don't we have spare memory from other decoms? [16:11:18] I rahter not kill off the old db server until we have a bunch of new ones [16:11:37] if we can repair it to last a few months, it would be ideal, as long as its a time investment [16:11:57] so i advise running memtest on it to find out which dimm is bad, then replace that dimm with on site spares. [16:12:10] got it! [16:12:27] im updating ticket swith the above as well [16:16:05] cmjohnson1: Ok, ticket updated with memory notes on swap [16:16:15] Recall with memory the replacement memory needs to be the same speed or faster, so check the speed rating on the memory in the system, and the spare memory to replace. As long as the spare memory is as fast as, or faster than the memory in the system, we are ok. [16:16:55] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.906 seconds [16:16:56] if the spare memory isn't fast enough, we may throw a few bucks at this for a single 8gb stick of memory. It is pretty damned cheap these days. [16:17:31] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [16:18:02] RECOVERY - swift-container-auditor on ms-be4 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [16:19:13] robh: sounds good...i will get to it a little later...after the 720's [16:19:21] cool [16:25:07] New patchset: Hashar; "rsyslog did not reload on config change" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/6593 [16:25:27] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/6593 [16:32:17] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [16:43:36] LeslieCarr: ping [16:44:14] pong [16:44:19] LeslieCarr: do you mind pushing this https://gerrit.wikimedia.org/r/#/c/7026/ live for Partner IP Live testing? [16:44:39] LeslieCarr: I'll ask Dan Foy to buy you something for all your help [16:44:43] you didn't give me a chance to guess what you wanted ;) [16:44:44] LeslieCarr: maybe a pony [16:44:54] omg ponies!!!1! [16:45:04] LeslieCarr: oh, so sorry about the lack of guessing [16:45:54] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/7026 [16:46:05] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7026 [16:46:07] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7026 [16:47:24] LeslieCarr: thanks [16:47:30] LeslieCarr: how long until it's live? [16:47:38] about 2 minutes i'd say [16:47:44] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2588* [16:50:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:50:47] preilly: there was a puppet run already in progress, got to wait for those to finish up and then repull [16:51:05] LeslieCarr: okay thanks for the update [16:55:11] apergos investigated some errors and I reproduced them some. we both have to go. if someone could take a look at recent #-tech history that would be great. (search for Montreal and Choctaw) [16:57:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.246 seconds [16:57:56] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [16:58:43] !log flushed mobile varnish cache [16:58:46] Logged the message, Mistress of the network gear. [16:58:48] !log restarted mobile varnish instances [16:58:50] preilly: done [16:58:51] Logged the message, Mistress of the network gear. [16:58:59] PROBLEM - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is CRITICAL: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z CRITICAL - *2675* [17:13:15] LeslieCarr: thanks! [17:17:17] RECOVERY - ps1-b5-sdtpa-infeed-load-tower-A-phase-Z on ps1-b5-sdtpa is OK: ps1-b5-sdtpa-infeed-load-tower-A-phase-Z OK - 2375 [17:31:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:37:15] cmjohnson1/robh what's the latest on storage3? [17:37:48] jeff_green: we are going to have to order a new controller card. [17:38:12] i will get a procurement ticket to robh [17:38:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.724 seconds [17:38:23] cmjohnson1: so did dell tell you that it would fix it, and say the data would be intact? [17:38:32] cuz if the data isnt going to be intact, we are better just replacing the entire system [17:39:14] I think we should do both, and both ASAP [17:39:26] robh: we should replace the system anyway but Dell stated that replacing the card should allow to recover the data [17:39:27] we have no FR backups at the moment and several aspects of data collection are broken [17:39:38] once the card arrives I will have to import the foreign config [17:39:49] cmjohnson1: Ok, please enter a procurement ticket listing the exact card details needed [17:40:00] i will get it quoted and ordered [17:43:25] * jeremyb thought there was a spare card from srv217 (or something) (because the box was down anyway [17:43:34] ) [17:46:22] jeremb: that was the sas controller...i was looking at the wrong thing :( [17:46:44] oh ;( [17:47:23] square peg, round hole ;) (a tad more complex than that) [17:51:29] !log to shutting down storage3 [17:51:33] Logged the message, Master [17:54:53] PROBLEM - Host storage3 is DOWN: PING CRITICAL - Packet loss = 100% [18:11:05] !log turning db30 back on [18:11:08] Logged the message, notpeter [18:12:08] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:12:34] cmjohnson1: I htink db30 is just dead at this point [18:13:16] makes an awful noise and I can't get into management interface [18:13:19] can probably decom it [18:13:56] PROBLEM - swift-container-auditor on ms-be2 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [18:14:29] notpeter: odd...i will take a look at it [18:17:13] * jeremyb wonders how notpeter could possibly know how much noise it makes [18:17:44] jeremyb: palantirs [18:20:05] * jeremyb hasn't seen that movie [18:20:10] * jeremyb goes back to office hrs [18:20:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.474 seconds [18:20:54] jeremyb: reasonable :) [18:22:20] RECOVERY - swift-container-auditor on ms-be2 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [18:28:56] RECOVERY - Host storage3 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [18:32:41] maplebed: hello :) [18:32:54] hi. [18:34:41] AaronSchulz: something up? [18:34:54] nope, just saying hi :p [18:37:01] excellent. [18:42:28] New patchset: Ottomata; "site.pp - Adding Fabian on stat1.wikimedia.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7070 [18:42:47] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7070 [18:44:08] New patchset: Ottomata; "site.pp - Adding Fabian on stat1.wikimedia.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7070 [18:44:28] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7070 [18:44:29] New patchset: Jgreen; "enabling fundraisingdb dumps on db1025 since storage3 is dead" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7071 [18:44:48] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7071 [18:45:23] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7071 [18:45:26] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7071 [18:50:27] !log dns update for db61 and db62 [18:50:31] Logged the message, RobH [18:53:14] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:01:38] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.250 seconds [19:15:28] robh: robh: db 61 and 62 are ready for you [19:15:34] network ticket has been created [19:16:07] !log updating OpenStackManager on virt0 to master [19:16:10] Logged the message, Master [19:16:16] cool, thx! [19:16:22] too abd our net admins are afk. [19:17:22] it is lunch time [19:22:57] RobH: are those the 710's? [19:23:03] 720s yep [19:23:12] binasher: https://rt.wikimedia.org/Ticket/Display.html?id=2933 [19:23:18] once that is done, i can do the os install, or you can [19:23:42] yay [19:23:44] ok [19:27:03] Change abandoned: Demon; "Abandoning this for now so it doesn't clutter people's review queues--this really needs the upstream..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3285 [19:28:19] LeslieCarrafk: can you take a look at https://gerrit.wikimedia.org/r/#/c/7076/ [19:34:58] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:33] !log updating OpenStackManager on virt0 to master again [19:39:36] Logged the message, Master [19:43:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.358 seconds [19:45:01] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 270 seconds [19:46:31] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 0 seconds [19:58:52] db1025 delay is b/c it's dumping fundraising databases, no concern [20:02:47] New review: Siebrand; "Hello? Anybody out there who can help?" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/5783 [20:09:19] RoanKattouw: ^ that mentions you [20:09:58] Thehelpfulone: Nikerabbit already asked, I referred him to Ryan_Lane [20:10:11] ok :) [20:17:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:24:37] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.268 seconds [20:33:02] New patchset: Jgreen; "setting up fundraising banner impression log collection/compression on hume while storage3 is dead" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7110 [20:33:21] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7110 [20:33:37] New review: Jgreen; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7110 [20:33:39] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7110 [20:43:14] LeslieCarr: can you revert this mornings change? [20:43:37] LeslieCarr: as seen in https://gerrit.wikimedia.org/r/#/c/7076/ [20:45:14] preilly: yes i can [20:46:55] Change abandoned: Lcarr; "reverting" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7076 [20:47:35] Change restored: Lcarr; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7076 [20:47:44] New review: Lcarr; "(no comment)" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/7076 [20:47:47] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7076 [20:52:12] !log done [20:53:33] LeslieCarr: thanks [20:53:33] Logged the message, Mistress of the network gear. [20:54:41] oops why did i log that [20:54:41] heh [21:00:36] hahaha [21:04:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:15:04] RoanKattouw: I think you should take a look at change 5783/2 [21:15:04] RoanKattouw: I don't think this belongs in misc [21:35:52] New review: Bhartshorne; "this change needs an RT approved by drdee and ct before it can be merged." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/7070 [21:39:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:46:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.680 seconds [21:53:48] RoanKattouw: where's that sanitized squid config? [21:54:46] Ryan_Lane: /home/catrope/fromfenari.tar.gz [21:55:24] Hmph, there are *~ files and a .svn directory in that tarball :S [22:03:51] heh [22:05:03] sorry about that [22:05:10] no problem [22:19:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:28:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.812 seconds [23:02:12] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:05:34] New patchset: Pyoungmeister; "removing search1-13 stuff, adding search21-36" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7126 [23:05:53] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7126 [23:10:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.026 seconds [23:13:22] New patchset: Reedy; "Point xinetd at /home/wikipedia/common/wmf-config/extdist/svn-invoker.php rather than /home/wikipedia/common/php/extensions/ExtensionDistributor/svn-invoker.php" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7127 [23:13:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7127 [23:24:32] New patchset: Pyoungmeister; "removing search1-13 stuff, adding search21-36" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7126 [23:24:50] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7126 [23:26:07] New patchset: Pyoungmeister; "removing search1-13 stuff, adding search21-36" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7126 [23:26:34] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7126 [23:27:37] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/7126 [23:27:39] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7126 [23:29:54] New patchset: Ryan Lane; "Add device detection for blog" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7131 [23:30:13] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7131 [23:30:16] New review: Ryan Lane; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/7131 [23:30:18] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7131 [23:30:41] New patchset: Pyoungmeister; "making search20 install precise for testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7132 [23:30:59] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/7132 [23:31:20] New patchset: Thehelpfulone; "adding bnwiki to import sources as per bug 34791" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/7133 [23:31:38] New review: Pyoungmeister; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/7132 [23:31:40] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/7132 [23:32:03] RoanKattouw: can you check it for me please? ^ bug it s https://bugzilla.wikimedia.org/show_bug.cgi?id=34791 [23:33:08] Don't sign it [23:33:23] !log taking down search20 to do precise test-install [23:33:26] Logged the message, notpeter [23:33:31] oh ok [23:33:41] I saw someone else did sign it when they put it in reed [23:33:44] Reedy* [23:34:06] It's only needed if you're making a strange hack or similar [23:34:08] now here's a test, I remove my signature but how do I make it a new patch instead of a new commit? [23:34:21] git review -D 7133 [23:34:25] git commit -a --amend [23:34:27] git review -R [23:34:40] do I put the bug number in? [23:34:51] bug number is fine, yeah [23:35:21] PROBLEM - Host search20 is DOWN: PING CRITICAL - Packet loss = 100% [23:35:22] -d not -D [23:35:26] -D is draft I think [23:35:29] yeah -D didn't work [23:35:43] bah could not fetch information from gerrit [23:36:56] FYI, this is all documented on labsconsole/mediawikiwiki [23:37:18] errors too? [23:37:19] Could not fetch review information from gerrit [23:37:19] Permission denied (publickey). [23:39:33] https://www.mediawiki.org/wiki/Git/Workflow#SSH_and_.22permission_denied_.28publickey.29.22 [23:40:53] RECOVERY - Host search20 is UP: PING OK - Packet loss = 0%, RTA = 0.96 ms [23:41:37] it's using paegant so I shouldn't put the ssh key in ssh-add ~/.ssh/id_rsa to make it prompt you for the passphrase for your key and add it to the active keychain should I? [23:41:56] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:42:38] no [23:44:11] PROBLEM - SSH on search20 is CRITICAL: Connection refused [23:46:52] then I'm stumped, adding a hook fixed it last time - is there a hook for this RoanKattouw? [23:48:04] There's a reason that even though I use windows on my machines, I do git stuff in a ubuntu vm ;) [23:48:09] :P [23:48:10] http://pastebin.com/A2gVaW9T [23:48:13] that's the full log [23:48:50] PROBLEM - Lucene on search20 is CRITICAL: Connection refused [23:48:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.101 seconds [23:49:50] I seem to recall having something like that with drafts [23:49:56] Downloading refs/changes/33/7133/1 from gerrit into review/thehelpfulone/bug/34791 [23:49:57] Switched to branch 'review/thehelpfulone/bug/34791' [23:50:14] is that in a log somewhere? [23:50:22] is what? [23:50:36]