[00:08:27] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [00:08:27] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [00:08:27] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [00:57:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:59:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.126 seconds [01:24:30] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [01:24:30] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [01:33:57] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:37:06] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.971 seconds [01:41:27] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [01:41:27] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [02:12:16] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:24:44] !log LocalisationUpdate completed (1.21wmf5) at Mon Dec 10 02:24:44 UTC 2012 [02:24:53] Logged the message, Master [02:25:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.703 seconds [02:59:31] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [03:05:22] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [03:15:49] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.040 second response time [03:44:10] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [04:25:10] New patchset: Ori.livneh; "Yet another pmpta -> pmtpa" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37378 [04:36:55] basile: blog.wm.o has non-secure resources when fetched by HTTPS (at least a few images) [05:52:38] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [06:00:35] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [07:44:03] PROBLEM - Puppet freshness on ms-be3002 is CRITICAL: Puppet has not run in the last 10 hours [08:45:11] hello [09:02:35] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [09:13:50] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.007 second response time on port 11000 [09:32:28] New patchset: Stefan.petrea; "Added parsing modules for wikistats testing" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37800 [09:34:59] New review: Hashar; "Seems fine to me :)" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/37800 [09:36:22] hashar: ping [09:36:29] yeah there :-] [09:36:56] apergos: Hello 8) would you have time to get a few package installed on the cont int server please ? The change is https://gerrit.wikimedia.org/r/37800 [09:37:15] average_drifter: it is quiet during eu morning but we have a few ops floating around to assist us nonetheless [09:37:25] depending how much they are busy with other projects though [09:39:42] im just looking at bug 42860 somehow some webm files have Content-Type: text/plain [09:40:04] anyone knows what layer this could go wrong? [09:40:22] example url: https://upload.wikimedia.org/wikipedia/commons/thumb/a/a2/Le_Voyage_dans_la_Lune_%28Georges_M%C3%A9li%C3%A8s%2C_1902%29.ogv/Le_Voyage_dans_la_Lune_%28Georges_M%C3%A9li%C3%A8s%2C_1902%29.ogv.480p.webm [09:40:33] !b 42860 [09:40:33] https://bugzilla.wikimedia.org/42860 [09:40:43] sometimes i get text/plain sometimes an ok webm type [09:41:37] j^: I guess the content type is set by the mediawiki extension isn't it ? [09:41:45] could be that some servers do not properly set it [09:41:46] or [09:42:04] its set while putting it into swift yes [09:42:04] the file might be cached on different cache and one of the copy has a wrong content type [09:42:33] since sometimes it returns ok i am womdering if some other part of the cache chain also sets it [09:43:41] hashar: shall we ask for second oppinion on https://gerrit.wikimedia.org/r/#/c/37800/ ? [09:44:21] j^: I seem to always get it with text/plain though if I append a query parameter ( url/.…webm?ohihatecache ) I get audio/webm [09:44:41] j^: so it looks like to me the cache has the wrong info [09:45:12] average_drifter: I can't merge changes in operations/puppet.git , only roots can. [09:45:21] average_drifter: so we need someone from ops to look at it :-] [09:45:23] j^: [09:45:52] hashar: alright :) [09:46:39] hashar: i guess as soon as https://gerrit.wikimedia.org/r/#/c/35574/ is in production it should be easier to reset those caches [09:50:07] j^: I have updated the bug report listing the curl command and the headers [09:50:20] j^: nothing I can do about it, I have no idea how to purge that from the upload caches [10:00:42] j^: that bug was fixed a while ago [10:00:45] but some cache content may still have it [10:00:54] !g I38bb589f10ab3253472fd5b5fbb0a19b80b4d9e1 [10:00:54] https://gerrit.wikimedia.org/r/#q,I38bb589f10ab3253472fd5b5fbb0a19b80b4d9e1,n,z [10:02:17] mark: any way to purge the cache if a file has been identified? [10:02:38] this is only happing in esams right? [10:03:38] yes [10:03:42] only in esams [10:03:59] and mediawiki doesn't seem to purge those [10:04:42] i changed that, https://gerrit.wikimedia.org/r/#/c/35574/ [10:04:45] that was the discussion about purging vs removing the scaled files, remember [10:04:47] but its not deployed [10:04:48] ah [10:06:09] so in the future it should be possible to fix sending ?action=purge [10:09:16] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [10:09:16] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [10:09:16] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [10:14:12] hashar: this link you gave me, it talks about wikistat testing, that;s the one you want merged? [10:15:17] New review: Siebrand; "The problem here is "shellpolicy"? What does that mean?" [operations/mediawiki-config] (master) C: 0; - https://gerrit.wikimedia.org/r/31823 [10:20:36] apergos: yup [10:20:59] apergos: https://gerrit.wikimedia.org/r/37800 the perl modules are required by the analytics team [10:21:05] apergos: their tests are written in perl :-] [10:21:24] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37800 [10:21:42] you just linked me the same change, did you mean to? [10:21:51] yup [10:22:01] so you don't have to look it up in case you closed your browser :-] [10:22:11] oh. no I had the tab open [10:22:13] thanks though [10:22:21] average_drifter: Ariel merged the change that get the new perl packages on gallium ! [10:22:28] tabs are wonderful things [10:22:50] wait for it please [10:23:03] I still need to merge on sockpuppet (just done) and run on gallium [10:26:42] hm [10:26:57] ah there it is [10:27:06] your change is live, have fun with the perl scripts [10:27:13] apergos: thanks!!! [10:27:24] I'm going to be afk for a bit, (visit to lawyer), back in a little while [10:48:52] PROBLEM - Host ms-be3002 is DOWN: PING CRITICAL - Packet loss = 100% [11:06:25] RECOVERY - Host ms-be3002 is UP: PING OK - Packet loss = 0%, RTA = 109.32 ms [11:21:43] RECOVERY - SSH on ms-be3002 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [11:25:35] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [11:25:35] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [11:42:14] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [11:42:14] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [12:02:16] New patchset: MaxSem; "Temporarily raise $wgMaxCoordinatesPerPage" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37823 [12:10:55] Change merged: MaxSem; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37823 [12:13:34] !log maxsem synchronized wmf-config/CommonSettings.php 'https://gerrit.wikimedia.org/r/#/c/37823/' [12:13:44] Logged the message, Master [13:00:10] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [13:04:35] New review: Dereckson; "Shellpolicy issue has been mitigated (cf. bug 41757, comments 8 and 9)." [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/31823 [13:06:46] New patchset: Dereckson; "(bug 42737) Rights configuration on ur.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37411 [13:15:47] jeremyb: I think I've fixed all of the theme's URLs to be protocol-independent, but I can't fix all the images that people insert into individual posts. [13:16:48] guillom: well can we at least start by figuring out a way to make images correct for new posts? [13:17:00] guillom: the problems i saw looked pretty recent [13:18:32] * jeremyb has to run in a min [13:19:27] jeremyb: I don't know if it's a WordPress bug (in which case we can't fix it) or if it only happens when we hotlink, e.g. from Commons (in which case the only way is to fix them manually). [13:20:35] guillom: i do know it definitely happens with uploads to blog.wm.o [13:20:51] the 2 that i noticed were http://blog.wm.o/... [13:23:35] It seems that there are workarounds ( http://www.deluxeblogtips.com/2012/06/relative-urls.html ) but I don't get why this wouldn't be implemented in WP itself. [13:23:42] Anyway, not my highest priority atm :) [13:23:58] guillom: anyway, i guess that means no WMF person has worked on it yet. /me will click that link later [13:24:10] bbl [13:45:50] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [14:22:46] apergos , hashar thanks ! [14:23:26] yw (mostly hashar) [14:24:31] team work ! ;-D [14:25:10] :-) [14:47:20] RECOVERY - Puppet freshness on ms-be3002 is OK: puppet ran at Mon Dec 10 14:47:02 UTC 2012 [14:49:43] wi guys [14:49:50] what's up with Gerrit ? [14:49:55] I need to do a git review but I can't [14:50:03] I get [14:50:18] error: The requested URL returned error: 403 while accessing https://gerrit.wikimedia.org/r/p/analytics/wikistats.git/info/refs [14:54:24] Hello ? [14:54:34] I can't push my changes in Gerrit [14:54:38] average_drifter: i think san fransisco's asleep :) [14:54:39] Any changes [14:54:45] jeremyb: ok [14:56:17] what can I do ? [15:00:04] ok got it working [15:07:53] RECOVERY - NTP on ms-be3002 is OK: NTP OK: Offset -0.02130961418 secs [15:53:32] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [16:01:37] PROBLEM - Puppet freshness on ssl3001 is CRITICAL: Puppet has not run in the last 10 hours [16:43:11] New patchset: Jgreen; "set tmp suffix on in-progress files so we can ignore them for rsync" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37851 [16:44:04] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37851 [17:22:43] New patchset: Demon; "Hook for Special:Version" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37854 [17:23:21] New patchset: Demon; "Hook for Special:Version" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37854 [17:47:36] PROBLEM - Host constable is DOWN: CRITICAL - Host Unreachable (208.80.152.151) [17:49:24] RECOVERY - Host constable is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [17:58:08] !log demon synchronized wmf-config/extdist/svn-invoker.conf 'More futile efforts to fix extdist' [17:58:17] Logged the message, Master [18:12:48] can someone please review https://gerrit.wikimedia.org/r/#/c/35298/ [18:25:13] !log temp. depooling ssl4 [18:25:20] Logged the message, Master [18:34:23] did you want to add IEMobile to that preg match, MaxSem? I see it's in devicedetection, but maybe you already test for 'mobile' in the string someplace/ [18:34:26] ? [18:34:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:36:25] RoanKattouw_away: where you at? [18:36:32] I found our issue with git-deploy [18:36:34] apergos, I ve removed it in the second patchset as it was optimised out of DeviceDetection too [18:36:44] ok [18:37:30] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35298 [18:38:13] tyvm apergos [18:38:45] what hosts need a puppet run now? [18:39:30] MaxSem: [18:39:41] varnish, but it's not urgent so a usual puppet run would suffice [18:39:53] ok [18:42:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.270 seconds [18:43:41] apergos, can you also take a look at https://gerrit.wikimedia.org/r/#/c/35931/ ?:) [18:43:51] in just a sec, yes [18:48:23] heya, LeslieCarr or other networking knowledgeables [18:48:35] whats up [18:48:40] got a sec to help me understand linux udp packet loss re udp2log? [18:49:03] sure, though the packet loss stat isn't actually packet loss [18:49:07] it's loss in the log processing [18:49:23] right [18:49:24] right, well [18:49:32] except, i see drops in /proc/net/udp [18:49:35] or [18:49:51] i see, that just means that udp2log isn't able to remove items from the buffer fast enough? [18:49:52] oh? increasing ? [18:49:53] yes [18:49:58] this is on analytics machines, btw [18:50:02] not the main udp2log machines [18:50:05] those all run unsampled [18:50:16] i'm trying to do stuff with unsampled logs [18:50:17] ah [18:50:33] Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37854 [18:50:53] which machines ? [18:51:05] i can check on the interfaces to see if there's drops there [18:51:06] just in case [18:51:26] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/35931 [18:51:35] !log demon synchronized wmf-config/CommonSettings.php 'New link for Special:Version' [18:51:42] welp, i've got 2 i'm playing with right now, [18:51:43] Logged the message, Master [18:51:51] analytics1026 is the one that is easier to play with for debugging [18:51:57] apergos, you're my hero! [18:52:12] for at least the next 5 minutes :-D [18:52:13] ok [18:52:14] but, I think this is application related somehow, leslieCarr [18:52:15] because [18:52:31] I know how it goes here: "what has ops done for us lately?" ;-) [18:52:32] i'm using Tim's packet-loss log code to print out a packet loss report [18:52:44] that code just looks at hostnames and seq numbers in the log stream [18:52:52] and then prints out a report if it seems missing seqs [18:53:21] so, when I run the unsampled log through packet-loss log with no other udp2log processes [18:53:28] packet loss is close to nothing, or 0 [18:53:54] but, if I add another unsampled process (this one is using udp-filter and then sending data out to another machine) [18:54:19] all the sudden I get lots of dropped packets, and packet loss log reports ~60% loss [18:54:47] well the interface itself is getting very little traffic with no lost packets [18:54:57] on analytics1026? [18:54:59] how can you tell? [18:55:07] tcpdump? [18:55:26] ifconfig? [18:56:11] the switch information [18:56:17] oh [18:56:26] its getting very little traffic? [18:56:35] yeah [18:56:40] dstat —net shows ~50MB recv / sec [18:56:54] (this is multicast, btw) [18:57:01] reading from multicast ip [18:57:46] yeah, i'm seeing a max of 400mbit a second [18:57:55] and 0 drops [18:58:07] aye ok, sounds about right, [18:58:33] ok 0 drops at switch, so that means that its all on an26, right? [18:58:35] MaxSem: your change is not live on yttrium [18:58:40] yep [18:58:43] ok, ja [18:58:48] *now [18:59:26] that's what I've figured so far, because if I run only the packet-loss filter to measure loss at the log line level, i get very little loss [18:59:35] but if I run another process, I start getting lots [18:59:43] and i'd like to understand why [18:59:47] so [18:59:58] from netstat -u [19:00:00] netstat -su [19:00:01] 2377440535 packet receive errors [19:00:10] /proc/net/udp says [19:00:13] drops [19:00:17] 13195408 [19:00:22] and that keeps increasing [19:00:59] now that is a good question that i don't have the answer to yet [19:01:01] i'm suspecting that there is some buffer issue with multiple unsampled processes [19:01:21] we have some production machines that are running unsampled processes, but I think that there is only one when that happens [19:01:22] yeah [19:01:25] there were also changes to /etc/solr/conf/solrconfig.xml and /etc/default/jetty [19:01:43] and I don't understand all of the network buffers that the nic or the kernel have in between the application, especially in context of udp2log, since it forks a process for each defined filter [19:02:00] hrm, is there a local buffer in the application ? [19:02:05] yes i think so [19:02:10] // Process received packets [19:02:11] const size_t bufSize = 65536; [19:02:11] ? [19:02:19] yeah [19:02:19] bytesRead = socket.Recv(receiveBuffer, bufSize); [19:03:29] hrm, want to try increasing that ? [19:03:42] ergh, woul dhave to recompile it, but i could [19:04:07] hm, before I do lemme see if I can understand how sockets and buffers work with forked processes like this [19:04:19] apergos, that's ok - these changes are made by the solr class [19:04:26] ok great [19:04:27] ok [19:04:52] also, this stackoverflow question may be relevant - http://stackoverflow.com/questions/6627702/multicast-packet-loss-running-two-instances-of-the-same-application [19:05:24] (stackoverflow -- making sysadmins seem like magical fountains of knowledge for years!) [19:05:43] haha, i've been googling all morning, haven't read this one yet... [19:05:52] read many a stack overflow posts, hehe [19:11:00] !log fixing test-star vs. old-star and new-star WP certificate/key confusion on ssl hosts.. arrg [19:11:08] Logged the message, Master [19:17:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:37] New patchset: Jgreen; "redo for new environment" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37864 [19:24:40] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37864 [19:28:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.649 seconds [19:32:21] ok, LeslieCarr, as far as I can tell from reading udp2log code [19:32:38] there is a single process that reads from the socket, and writes to an application buffer [19:33:09] and epoll is used to notify each process that they need to read from the app buffer [19:33:30] so, it seems like this is a different problem than the stack overflow link you sent [19:33:48] <^demon> Just started scap. Hang on to your pants everybody. [19:33:55] the OP there was trying to read from the same mulitcast ip in different processes [19:34:05] udp2log should only use a single process to read off the network socket [19:38:08] !log repooling ssl4 [19:38:18] Logged the message, Master [19:39:53] ottomata: I'm catching up on the backlog here. what's the new line you're adding to the config? [19:40:21] ...and which machine are you debugging this on? [19:40:22] oh? i'm not adding new lines [19:40:26] this is on analytics machines [19:40:30] not on any of the prod instanes [19:40:32] instances [19:40:42] !log restarting nginx on all ssl hosts [19:40:46] i'm dealing with unsampled udp2logs, trying to understand why I can't processthem without dropping packets [19:40:50] Logged the message, Master [19:41:13] this is happening in two cases for me right now [19:41:23] the easiest to debug is anlaytics1026 [19:41:39] i'm using an1026 to help drdee and average_drifter test webstatscollector changes [19:41:50] webstatscollector operates on unsampled udp2logs [19:41:55] https://en.wikipedia.org isn't working for me. Known issue? [19:42:09] I get "Firefox can't establish a connection to the server at en.wikipedia.org." [19:42:43] so i'm trying to get them an unsampled stream that I can pipe through filter (now running as udp-filter -o), and then pipe that over to stat1 so stefan can test collector results and make sure they work [19:42:57] but, whenever I try to use udp2log with an unsampled pipe process [19:43:18] <^demon> Marybelle: wfm. [19:43:19] i start loosing packets. the packet-loss log reports around 60% loss [19:43:43] Marybelle: working for me as well, at least on Chrome. Trying Firefox in a sec [19:43:45] ^demon: Hmm, it was working for me and then suddenly stopped. [19:43:54] Other sites seem to load fine... [19:44:11] robla: we also need the unsampled streams to start supporting funnel analysis for the different product teams [19:44:36] PROBLEM - HTTPS on ssl1004 is CRITICAL: Connection refused [19:44:50] <^demon> I wish downforeveryoneorjustme did ssl. [19:45:37] Firefox working for me as well [19:46:06] i think it was related to this: mutante: !log restarting nginx on all ssl hosts [19:46:11] It might be an issue with XO... [19:46:55] Another report that https isn't working, but http is. [19:47:05] I'm able to reproduce. http is fine; https isn't working. [19:47:43] Hi Terry, Sumana. [19:47:56] I can connect to en.wp over http but not https right now -- but wikisource https works [19:47:56] sumanah sees the problem as well [19:48:00] (hi Marybelle) [19:48:12] maybe HTTPS everywhere is the issue.... [19:48:15] weirdly, the person right next to me (Katie) is connecting over https and is fine [19:48:26] robla: It's failing for me in Chrome without HTTPSEverywhere installed. [19:48:40] It may be a provider issue and not an issue with Wikimedia. [19:48:40] so much for that then.... [19:48:45] Though I just saw nagios complain about SSL... [19:48:45] well, I tried just manually in epiphany and without HTTPS Everywhere and it failed [19:49:24] ipv6 maybe? [19:49:27] Wiktionary works with & without HTTPS. [19:49:37] It's working again now. [19:49:46] ok, problem is fixed [19:49:55] https://en.wikipedia.org is fine for me now [19:49:59] Clogged tubes, I guess. [19:50:16] RECOVERY - HTTPS on ssl1004 is OK: OK - Certificate will expire on 07/19/2016 06:51. [19:50:38] !log replaced wikipedia SSL cert with new DigiCert (rather than RapidSSL) (RT-3639) [19:50:46] Logged the message, Master [19:51:37] aha, ok [19:57:52] !log demon ran sync-common-all [19:58:18] Logged the message, Master [19:59:20] !log demon rebuilt wikiversions.cdb and synchronized wikiversions files: testwiki to 1.21wmf6 [19:59:33] Logged the message, Master [20:01:24] !log demon rebuilt wikiversions.cdb and synchronized wikiversions files: testwiki back to 1.21wmf5 [20:01:37] Logged the message, Master [20:01:46] PHP fatal error in /home/wikipedia/common/wmf-config/CommonSettings.php line 2861:  [20:01:46] require() [function.require]: Failed opening required '/home/wikipedia/common/php-1.21wmf6/../wmf-config/ExtensionMessages-1.21wmf6.php' (include_path='/home/wikipedia/common/php-1.21wmf6/extensions/TimedMediaHandler/handlers/OggHandler/PEAR/File_Ogg:/home/wikipedia/common/php-1.21wmf6:/home/wikipedia/common/php-1.21wmf6/lib:/usr/local/lib/php:/usr/share/php') [20:02:48] overzealous search and replace for s/wmf/wmf6/ in CommonSettings.php? [20:03:03] <^demon> Yeah, I rolled back to wmf5. [20:03:13] <^demon> No, we need l10n for 1.21wmf6. [20:03:26] !log demon Started syncing Wikimedia installation... : [20:03:32] <^demon> I forgot sync-common-all doesn't do l10n. [20:03:33] Logged the message, Master [20:04:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:04:46] * robla goes to grab lunch while scap is running [20:05:34] tmh1 enwiki Error connecting to 10.0.6.73: User 'wikiadmin' has exceeded the 'max_user_connections' resource (current value: 80) [20:05:42] well I guess it was hit then :) [20:06:58] woo [20:07:02] it works! [20:07:38] !log demon Finished syncing Wikimedia installation... : [20:08:02] Logged the message, Master [20:10:49] PROBLEM - Puppet freshness on analytics1007 is CRITICAL: Puppet has not run in the last 10 hours [20:10:49] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [20:10:49] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [20:11:02] New patchset: Asher; "returning pc1 to svc" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37874 [20:11:46] !log demon rebuilt wikiversions.cdb and synchronized wikiversions files: [20:12:03] Logged the message, Master [20:15:49] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37874 [20:17:34] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.172 seconds [20:19:31] !log asher synchronized wmf-config/CommonSettings.php 're-enabling pc1 for sql: the bag o stuffening' [20:19:46] Logged the message, Master [20:20:18] New patchset: Jgreen; "add conf for civicrm amazon audit script" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37875 [20:20:53] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37875 [20:26:19] New patchset: Demon; "Various updates for 1.21wmf6" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37878 [20:27:29] !log demon Started syncing Wikimedia installation... : [20:27:36] Logged the message, Master [20:36:39] Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37878 [20:38:08] New patchset: Jgreen; "typo fixes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37936 [20:39:43] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37936 [20:40:38] notpeter: around? [20:41:14] hey [20:44:30] notpeter: can you manually re-add the stuff at http://wikitech.wikimedia.org/view/Cron_jobs#manual_cron_jobs to hume [20:44:37] it has the crontab code snippets [20:44:50] right now, none of that stuff is run anymore [20:45:06] ...probably would help to be puppetized or something ;) [20:45:20] sure [20:46:50] why was none of that put into puppet? [20:47:30] AaronSchulz: all of them? [20:47:34] including the svn shit? [20:48:12] let's just put it all into puppet instead of any manual stuff ? [20:48:14] please? [20:49:09] notpeter: this is https://bugzilla.wikimedia.org/show_bug.cgi?id=42152 [20:50:57] LeslieCarr: oh, definitely [20:51:05] but I'm trying to at least figure out what needs to go in [20:51:31] AaronSchulz: ok. but what I'm asking is "is that page up to date?" [20:51:44] "are all of those crons needed at present" [20:51:55] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:52:27] notpeter: that doxygen one might be useless...should use git [20:52:44] might want to ask Ryan_Lane about the ldap one maybe [20:53:11] lol, zwinger [20:53:21] notpeter: anyway the top three are fine [20:53:43] New patchset: Alex Monk; "(bug 42921) Add upload_by_url permission to commons image-reviewers" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37939 [20:53:52] ok [20:55:28] <^demon> The svn crap should not be on hume--that's on formey as it should be. [20:55:46] ok [20:56:23] <^demon> (And I *think* it's mostly puppetized, although all 3 of those are going away in the semi-near-ish-future) [20:57:22] New patchset: Dzahn; "make mobile wikipedia use new SSL cert and kill test-star afterwards" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37940 [20:58:39] New review: Dzahn; "ok now,after manual fix to key mismatch... and about time ..would have expired soon" [operations/puppet] (production); V: 1 C: 2; - https://gerrit.wikimedia.org/r/37940 [20:58:40] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37940 [20:59:47] New patchset: Demon; "test2wiki also on 1.21wmf6" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37941 [21:00:08] Change merged: Demon; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/37941 [21:01:04] New review: Ryan Lane; "After discussion with Asher, it may not be possible to properly send a vary header for the redirects..." [operations/apache-config] (master); V: 0 C: -2; - https://gerrit.wikimedia.org/r/13293 [21:03:27] !log demon Finished syncing Wikimedia installation... : [21:03:35] Logged the message, Master [21:03:47] !log demon rebuilt wikiversions.cdb and synchronized wikiversions files: test2wiki also on 1.21wmf6 [21:03:58] New review: Asher; "We should move all of this rewrite logic to varnish, once varnish replaces squid. Hopefully that ca..." [operations/apache-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/13293 [21:04:04] Logged the message, Master [21:04:58] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.259 seconds [21:09:23] AaronSchulz: should the update flagged revs script possibly not be run as root? [21:10:12] I don't see why it would have to be [21:10:42] ok [21:10:53] setting it ot run as apache user, then [21:14:08] New patchset: Lcarr; "l10nupdate user required on these hosts for deployment rt # 4052" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37942 [21:14:12] Reedy: ^^ this should fix your problem [21:14:15] notpeter: can you close that bug now? [21:18:27] !log demon rebuilt wikiversions.cdb and synchronized wikiversions files: wikidatawiki & mediawikiwiki to 1.21wmf6 [21:18:40] Logged the message, Master [21:19:34] AaronSchulz: yes, I'm writing the puppetz now [21:20:10] New patchset: Ori.livneh; "Yet another pmpta -> pmtpa" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37378 [21:20:26] \o/ [21:20:36] ^^^ ops, if you'd like a free access switch [21:20:53] see two character change above [21:26:19] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [21:26:19] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [21:29:40] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37378 [21:29:53] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37942 [21:30:08] thanks LeslieCarr [21:31:32] too bad all that does is start monitoring a management aggregation switch ori-l [21:31:40] if it actually gave us a free access switch that would rock [21:33:26] New review: Alex Monk; "Doesn't look like I2c6ab07d is ready for deployment, see Ryan Lane's -2 comment. Can you remove your..." [operations/apache-config] (master) C: 0; - https://gerrit.wikimedia.org/r/34113 [21:35:23] LeslieCarr: i think using puppet specs to generate network equipment on a 3D-printer might be a few major puppet releases away :) [21:38:03] New patchset: Pyoungmeister; "puppetizing the unpuppetized crons for hume in http://wikitech.wikimedia.org/view/Cron_jobs#manual_cron_jobs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37946 [21:38:58] AaronSchulz: how does that look ^ [21:40:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:42:04] LeslieCarr, I know you don't know this stuff, but I'm bugging you cause you have network brain, and I just need some bouncing [21:42:04] so. [21:42:08] this udp2log thing [21:42:34] i'm not entirely clear on how netowrk kernel buffers work [21:43:16] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [21:43:16] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [21:43:32] New review: Siebrand; "Per comment above" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/31823 [21:44:15] please correct this summary: packets come into nic, and get copied somewhere in kernel mem space, kernel is notified that there are packets to read [21:45:40] PROBLEM - check_minfraud_secondary on payments3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:46:00] udp2log calls read() in an infinite loop, which reads from the socket buffer (egh, not clear on this bit I think) [21:46:38] then, udp2log writes the bytes that it read from the socket into its own buffer, which has an epoll call back attached to it, which causes each process to read from the udp2log buffer [21:46:42] so, i'm trying to figure out [21:47:03] how is it that running more unsampled udp2log procs causes kernel to drop more packets [21:47:59] i guess, by running more packets, there is more to do, so the main udp2log process doesn't get enough processor time to read off of the socket fd? [21:48:00] hm [21:48:18] if that was true, if I niced up the main udp2log process, it should help [21:50:10] RECOVERY - check_minfraud_secondary on payments3 is OK: HTTP OK: HTTP/1.1 302 Found - 120 bytes in 0.422 second response time [21:53:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.169 seconds [21:54:44] AaronSchulz: well, I'm going to merge, as I think it's about right. please do have a look over when you get a chance [21:55:41] binasher, you around? [21:56:13] New patchset: Pyoungmeister; "puppetizing the unpuppetized crons for hume in http://wikitech.wikimedia.org/view/Cron_jobs#manual_cron_jobs" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37946 [21:56:28] ottomata: i was getting food but that all sounds logical [21:56:42] just tried renicing, no dice [21:56:44] no nice no dice [21:57:50] !log one more nginx restart on ssl boxes for mobile cert [21:57:58] Logged the message, Master [21:58:49] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37946 [22:00:57] but, LeslieCarr, just so i'm not misleading myself [22:01:05] the reason why /proc/net/udp would show dropped packets [22:01:21] is if the packets are not read off of the network buffer fast enough [22:01:24] right? [22:02:26] ottomata: is the box thread thrashing? [22:02:40] err i mean like excessive context switching [22:02:45] no [22:03:08] this test one has 400MB free, but I have another box that has 192G Ram and it is doing the same thing [22:03:15] oh [22:03:18] hmm [22:03:49] fwiw, at CL we ended up having to use multiple collector boxes [22:04:44] with less traffic [22:04:46] how'd you do that? filtered by mod seq number? [22:04:47] aye [22:04:48] ottomata: yes [22:05:01] ok cool, thanks LeslieCarr [22:05:15] Jeff_Green, dont' think I can do this in this case [22:05:21] ottomata: I wasn't super involved, but I believe they split the proxy hosts into different multicast streams [22:05:24] Jeff_Green: correct me if i am wrong, but i believe it's that we had machines logging to different multicast ip's and the logger boxes only listened on X ip [22:05:43] i mean, we want to eventually be able to get the sources to log directly to kafka [22:05:50] right, if we could change where the sources sent, we could do that [22:06:04] RECOVERY - Host ms-be1001 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [22:06:12] the nice thing about kafka in this case, is that it uses zookeeper to configure itself, so if you need more boxes listening you don't have to modify the source config files [22:06:21] buuuuut, that's irrelevant to this problem [22:07:04] i just can't get udp2log to write to multiple unsampled pipe processes [22:07:21] well, i mean, it can, but when it does it won't read from the socket fast enough [22:08:28] !log mobile and non-mobile Wikipedia now use just ONE certificate.. a new "star.wp" with an ".m." in it, valid until 2016 by DigiCert [22:08:36] Logged the message, Master [22:08:42] ottomata: so here's a thought, which i may be wrong on [22:09:03] what about if you split up udp2log into two processes - 1 just spits out the log into some sort of temp file [22:09:17] the second then does the analysis and creates the actual files we want [22:09:43] well, in this case I'm not even creating any files [22:09:58] PROBLEM - swift-object-server on ms-be1001 is CRITICAL: Connection refused by host [22:09:58] PROBLEM - swift-container-replicator on ms-be1001 is CRITICAL: Connection refused by host [22:09:58] PROBLEM - swift-object-replicator on ms-be1001 is CRITICAL: Connection refused by host [22:09:58] PROBLEM - swift-account-reaper on ms-be1001 is CRITICAL: Connection refused by host [22:10:11] this is sending the data out on another socket, buuuut, you are rigiht, maybe I shoudl experiment with other proc types (like writing to a file) [22:10:16] PROBLEM - swift-account-server on ms-be1001 is CRITICAL: Connection refused by host [22:10:16] PROBLEM - swift-container-updater on ms-be1001 is CRITICAL: Connection refused by host [22:10:16] PROBLEM - SSH on ms-be1001 is CRITICAL: Connection refused [22:10:16] PROBLEM - swift-object-updater on ms-be1001 is CRITICAL: Connection refused by host [22:10:17] to see if the same thing happens [22:10:25] PROBLEM - swift-container-server on ms-be1001 is CRITICAL: Connection refused by host [22:10:43] PROBLEM - swift-container-auditor on ms-be1001 is CRITICAL: Connection refused by host [22:10:43] PROBLEM - swift-object-auditor on ms-be1001 is CRITICAL: Connection refused by host [22:10:52] PROBLEM - swift-account-replicator on ms-be1001 is CRITICAL: Connection refused by host [22:10:52] PROBLEM - swift-account-auditor on ms-be1001 is CRITICAL: Connection refused by host [22:11:52] ottomata: this is pretty well over my head but have you looked at different kernels--like the low latency or preemptable ones? [22:13:50] no [22:13:53] over my head too :) [22:14:09] buti am verrryyyy suspicous that it is simpler than this [22:14:45] it's a lot to process [22:14:51] because, locke is even running 4 unsampled pipe processes righ tnow [22:14:53] and its doing ok [22:15:42] we're sending 100% of our hits to locke? [22:16:53] i guess at CL we were writing files, that added a lot of work for the collector [22:24:07] speaking of locke, i can't ssh into it directly ... [22:24:10] checking it out on console [22:25:54] oh nm, i'm having some weird problem from ssh'ing through [22:27:49] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:29:09] (Jeff_Green, what's CL?)_ [22:29:40] craigslist [22:30:13] PROBLEM - NTP on ms-be1001 is CRITICAL: NTP CRITICAL: No response from NTP server [22:30:33] thought so, they use udp2log at craigslist? [22:32:38] PROBLEM - Swift HTTP on ms-fe1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:32:38] PROBLEM - Swift HTTP on ms-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:36:19] i believe cl used their own thing [22:36:40] PROBLEM - Host ms-be1001 is DOWN: PING CRITICAL - Packet loss = 100% [22:36:46] (it's been a while ago for me so my memory is more faint than jeff's, plus i was only working on their tubes) [22:41:46] RobH: do you think i should move the parsoid request ticket to core-ops or to procurement ? [22:42:22] the actual server procurement should go in procurement [22:42:33] for all the installs/deploys i have been making new tickets and linking [22:43:27] i know the tickets may be a bit of a disucssion [22:43:30] and a bit of hardware procurement [22:43:35] Can someone please do this on fenari: rm -rf /home/wikipedia/common/php-1.21wmf1 [22:43:40] in which case i may leave in core-ops but then make a procurement ticket and link. [22:43:53] okay [22:44:00] Reedy: i will do that as well [22:44:05] the previous okay was to RobH [22:44:07] thansk [22:44:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.042 seconds [22:44:15] even if its just 'need to get hardware for the linked core-ops discussion' [22:44:15] screwy permissions on old git objects [22:44:18] and it gets filled out later [22:45:03] will make that ticket [22:45:07] being ticket bitch is tough work ! [22:45:39] Reedy: done [22:45:48] Cheers [22:45:53] !log removed /home/wikipedia/common/php-1.21wmf1/ from fenari [22:46:00] Logged the message, Mistress of the network gear. [22:46:14] LeslieCarr: well i certainly appreciate it =] [22:48:33] ottomata: you might know this - does udp2log work in varnish or only squid ? [22:49:21] varnish, squid and nginx [22:49:39] sequence numbering doesn't work properly in nginx [22:49:42] cool, wikitech page out of date :) [22:49:43] but varnish and squid work well [22:49:51] now the real question is, how to set it up :) [22:53:39] set up, varnish udp logging? [22:54:26] LeslieCarr ^? [22:55:31] yes, specifically on the blog [22:55:35] (sorry, going through tickets) [22:57:21] that's done [22:57:33] who's that assigned to? [22:57:47] RT#? [22:58:53] https://rt.wikimedia.org/Ticket/Display.html?id=4049 [22:59:00] wrong one [22:59:01] oops [22:59:06] binasher: Is there any way to get ishmael aggregate slow queries etc accross multiple hosts (ie a cluster)? [22:59:07] here we go - https://rt.wikimedia.org/Ticket/Display.html?id=4027 [23:01:33] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [23:02:11] ottomata: ^^ [23:02:30] oh yeah, totally done [23:02:32] i will resolve [23:02:38] ottomata: yay thank you :) [23:02:53] * LeslieCarr hugs ottomata [23:03:27] tfinc: so who was working with you on https://rt.wikimedia.org/Ticket/Display.html?id=3738 ? [23:04:07] LeslieCarr: Max, Asher, Rob, Niklas [23:04:22] i'm just looking for someone to assign the ticket to [23:04:23] mwhahaha [23:04:33] i'm going to go with binasher for this one [23:04:48] RobH: are you interested in doing https://rt.wikimedia.org/Ticket/Display.html?id=4049 ? [23:06:23] notpeter: I was in a meeting, but that seems ok [23:06:39] ok, cool [23:10:51] New patchset: Pyoungmeister; "coredb: a bit more re-merging" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37963 [23:10:56] RobH: I am giving this to you but feel free to be like "woah" or ask for any help [23:11:16] it should be pretty simple - adding the new package using reprepro and then coordinating a time to do the upgrade [23:14:18] robh: rt4064 for the idrac licenses [23:14:22] in eqiad [23:14:27] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37963 [23:16:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:17:04] !log pgehres synchronized php-1.21wmf5/extensions/ContributionTracking/ 'Updating ContributionTracking to master' [23:17:11] Logged the message, Master [23:21:09] New patchset: Cmjohnson; "Updating MAC's addresses to reflect h/w change to 720xd's for ms-be10xx" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37965 [23:21:47] lesliecarr, thanks for your help today, i think we figured out the cause, i gotta run [23:21:49] talk to you laters [23:21:54] bye [23:22:16] can someone +2 my change ^^^ [23:22:45] cmjohnson1: are you not having +2 permissions ? [23:22:51] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37965 [23:23:04] notpeter: up for a few tickets? :) [23:23:07] and merged on sockpuppet [23:23:35] lesliecarr: there is a ticket in to fix that but I think it needs ryan_lane approval [23:23:37] LeslieCarr: I'm trying to finish up db module for asher. I can if you need me too, but I would really like to focus on this [23:23:52] LeslieCarr: what needs doing? [23:23:52] notpeter: thx [23:23:55] nothing is pressing for today [23:24:06] some poking at rt that has been around since december [23:24:27] i mean september [23:24:31] so definitely unurgent [23:24:35] heh, ok [23:24:38] i'm just going through all the ops-requests tickets [23:24:43] too much RT activity, cant find the stuff from a few hours ago anymore. but cool :) [23:24:59] gotcha. ping again tomorrow or weds :) [23:25:32] now that it's reassigned to you i don't care --- it's my dirty not-so-secret [23:26:31] ok, gotcha [23:26:32] ok, cool [23:27:46] New patchset: Pyoungmeister; "flip flopping db61 manifests for another test" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37967 [23:28:17] New patchset: Reedy; "RT #2295: Run cleanupUploadStash across all wikis daily" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37968 [23:28:42] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37967 [23:30:33] Reedy: house => "2", [23:32:36] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.307 seconds [23:33:16] AaronSchulz: Yes, I want them to have 2 houses [23:34:40] New patchset: Reedy; "RT #2295: Run cleanupUploadStash across all wikis daily" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37968 [23:35:39] Reedy, failagain [23:36:34] Not really [23:36:38] That fail was there in patchset 1 [23:36:47] New patchset: Reedy; "RT #2295: Run cleanupUploadStash across all wikis daily" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37968 [23:37:30] I give up [23:37:44] New patchset: Reedy; "RT #2295: Run cleanupUploadStash across all wikis daily" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37968 [23:46:33] PROBLEM - Puppet freshness on ms-be3 is CRITICAL: Puppet has not run in the last 10 hours [23:51:03] https://gerrit.wikimedia.org/r/37973 [23:51:32] * Reedy kicks fenari [23:59:54] New patchset: Pyoungmeister; "Revert "flip flopping db61 manifests for another test"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/37980