[00:00:19] !log mflaschen synchronized php-1.21wmf12/extensions/GuidedTour/GuidedTour.php 'Small bug fix to GuidedTour; removing unneeded dependency. https://gerrit.wikimedia.org/r/#/c/56546/1/GuidedTour.php' [00:00:25] Logged the message, Master [00:01:30] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:02:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [00:03:12] !log mflaschen synchronized php-1.21wmf12/extensions/GuidedTour/GuidedTour.php 'Small bug fix to GuidedTour; removing unneeded dependency. https://gerrit.wikimedia.org/r/#/c/56546/1/GuidedTour.php' [00:03:19] Logged the message, Master [00:06:29] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:07:49] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Fri Mar 29 00:07:43 UTC 2013 [00:08:09] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 30698 MB (3% inode=99%): [00:08:29] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:08:39] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [00:12:12] !log olivneh synchronized php-1.21wmf12/extensions/MoodBar 'Updating MoodBar to remove ClickTracking integration' [00:12:12] Logged the message, Master [00:14:39] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Fri Mar 29 00:14:33 UTC 2013 [00:15:29] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [00:26:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:27:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.194 second response time [00:28:28] New patchset: Reedy; "Update php symlink to 1.21wmf12" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56556 [00:28:41] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56556 [01:07:48] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [01:09:28] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 30251 MB (3% inode=99%): [01:09:58] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [01:16:56] ori-l: hey [01:17:02] paravoid: hey [01:17:14] I completely missed the patchset [01:17:22] sorry [01:17:25] no worries at all [01:17:26] ok to merge now? [01:17:35] only if you want to make me super happy [01:17:41] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54324 [01:17:48] :))) [01:18:06] done [01:18:06] ^ ori-l eating a burger emoticon [01:18:08] if i sudo puppetd -tv on vanadium, will it pull the changes, or do i need to wait 30 minutes? [01:18:19] YuviPanda: heh [01:18:22] it will [01:18:27] I was about to do that, but better if you do :) [01:18:34] ok, will do [01:18:51] you're more likely to notice something going wrong in the diffs [01:19:10] running.. [01:24:02] !log Stopping EventLogging daemons to allow Puppet to change 'eventlogging' user's home dir [01:24:09] Logged the message, Master [01:24:27] omg events dropping! [01:24:36] pid 21038, uptime 34 days, 15:18:55 [01:24:37] :( [01:24:41] it was a good run [01:26:29] hrm: err: /Stage[main]/Eventlogging::Archive/File[/etc/logrotate.d/eventlogging]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///files/eventlogging/logrotate at /var/lib/git/operations/puppet/modules/eventlogging/manifests/archive.pp:25 [01:27:26] whoops [01:27:30] but modules/eventlogging/files/logrotate is there [01:27:43] no, that's not how you refer to files in modules [01:27:48] my bad, should have spotted that [01:28:01] yes, damn you for not fixing my bugs [01:28:15] what should it be? [01:30:31] c'mon gerrit [01:30:37] New patchset: Faidon; "eventlogging: fix file path" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56561 [01:31:20] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56561 [01:31:28] ok, try again [01:31:36] btw, you had it right on ganglia.pp :) [01:31:41] oh, right [01:31:45] yeah, i just saw your change [01:33:10] New patchset: Dr0ptp4kt; "Unified default lang redirect from m. & zero." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55302 [01:34:14] dr0ptp4kt: and who are you? :) [01:35:25] hi there - adam baso here - sitting next to yurik this week while he's on site. [01:35:48] ah [01:36:00] I'm Faidon, as /whois says :) [01:36:50] I should probably update, LOL, will do. [01:37:42] paravoid: http://dpaste.org/9T1Gn/raw/ worked [01:37:44] New patchset: Dzahn; "add a script and cron to mail out bugzilla audit log and move bugzilla scripts to files/bugzilla instead of misc" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56562 [01:38:49] great [01:38:57] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [01:38:57] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [01:38:57] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [01:39:10] New patchset: Dzahn; "add a script and cron to mail out bugzilla audit log and move bugzilla scripts to files/bugzilla instead of misc" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56562 [01:42:09] paravoid: the zpubmon ganglia module doesn't appear to be reporting metrics http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&title=EventLogging&vl=events+%2F+sec&x=&n=&hreg[]=vanadium.eqiad.wmnet&mreg[]=%5E%28client-generated-raw%7Cserver-generated-raw%7Cvalid-events%29%24>ype=stack&glegend=show&aggregate=1 [01:42:17] but everything seems to be running, so i'll debug from home [01:42:21] but if you have any ideas, let me know [01:42:28] thanks very much again [01:45:40] actually, paravoid, got a second? [01:46:07] yes [01:46:17] if so, have a look at file { '/usr/lib/ganglia/python_modules/zpubmon.py' in modules/eventlogging/manifests/ganglia.pp [01:46:28] i tried to make it a symlink to /srv/deployment/eventlogging/EventLogging/ganglia/python_modules/zpubmon.py [01:46:35] the destination is there, but instead of a symlink i got a directory [01:47:02] /usr/lib/ganglia/python_modules/zpubmon.py: directory [01:47:16] recurse => true [01:47:18] why? [01:47:28] why why? [01:47:35] sorry, bad recursion joke [01:47:39] haha [01:48:02] oh, i was probably thinking that it would create parent dirs as necessary [01:48:10] but instead it's causing puppet to interpret the resource as a directory [01:48:33] yeah [01:48:51] recurse means "recursively copy directory" [01:49:58] New patchset: Ori.livneh; "Drop "recurse => true" from EventLogging Ganglia resources" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56563 [01:50:25] puppet has more booby traps than an indiana jones movie [01:51:02] paravoid: ^^ patch [01:51:09] Change merged: Faidon; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56563 [01:53:51] paravoid: http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&title=EventLogging&vl=events+%2F+sec&x=&n=&hreg[]=vanadium.eqiad.wmnet&mreg[]=%5E%28client-generated-raw%7Cserver-generated-raw%7Cvalid-events%29%24>ype=stack&glegend=show&aggregate=1 [01:53:57] THANK YOU, weee [01:54:00] what a relief [01:54:17] i felt like a thief in the night having this stuff running but unpuppetized [01:55:47] hahaha [01:56:37] :) thanks again and see you later [01:56:38] * ori-l runs home [01:57:46] bye! [02:06:09] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [02:08:19] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [02:08:49] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 29655 MB (3% inode=99%): [02:10:49] PROBLEM - Puppet freshness on cp3010 is CRITICAL: Puppet has not run in the last 10 hours [02:17:06] !log LocalisationUpdate completed (1.21wmf12) at Fri Mar 29 02:17:06 UTC 2013 [02:17:14] Logged the message, Master [02:22:49] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [02:26:41] New patchset: Odder; "(bug 46154) Override $wgGroupPermissions for thwiki Add abusefilter-log-detail and patrol for autoconfirmed on thwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56564 [03:04:41] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [03:06:51] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [03:07:21] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 29194 MB (3% inode=99%): [03:10:41] PROBLEM - Apache HTTP on mw1171 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:10:41] PROBLEM - Apache HTTP on mw1060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:10:41] PROBLEM - Apache HTTP on mw1054 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:11:11] PROBLEM - Apache HTTP on mw1172 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:11:31] RECOVERY - Apache HTTP on mw1171 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.336 second response time [03:11:32] RECOVERY - Apache HTTP on mw1054 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.061 second response time [03:11:32] RECOVERY - Apache HTTP on mw1060 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.149 second response time [03:12:01] RECOVERY - Apache HTTP on mw1172 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.057 second response time [03:18:12] Waiting for 10.64.16.145: 140 seconds lagged [03:19:08] db1050 [03:20:30] binasher: ^ [03:20:43] lots of Waiting for the slave SQL thread to advance position [03:21:57] a lot of wait cpu [03:22:08] db1051 and db1052 are more loaded [03:23:47] Reedy: gah, looking [03:24:20] thanks, its in the 200s now [04:01:33] 1 million jobs queued on enwiki in the last hour, 150k in one min in large write queries… [04:01:54] this cause a partial sight outage when it happened earlier today, and its likely to again [04:02:09] hmmm? [04:03:12] let's reparse 1/3rd of enwiki, shall we? [04:03:12] what inserts them? [04:03:20] template edits? [04:03:42] these are refreshLinks jobs [04:04:16] Reedy: around? [04:04:17] do you know the cause? [04:04:25] or should I start investigating too? [04:06:03] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:06:36] paravoid: i don't [04:07:02] Aaron|home: any changes in wmf12 that would be relevant? [04:07:28] no, just usage changes probably [04:07:43] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 28590 MB (3% inode=99%): [04:07:53] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Fri Mar 29 04:07:51 UTC 2013 [04:08:03] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:08:13] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [04:08:33] it looks like wikidata utilizes refreshlinks jobs [04:08:59] hrm, link? [04:09:03] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Fri Mar 29 04:08:57 UTC 2013 [04:10:03] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:10:03] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Fri Mar 29 04:09:57 UTC 2013 [04:11:03] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:11:43] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Fri Mar 29 04:11:39 UTC 2013 [04:12:03] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:12:23] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Fri Mar 29 04:12:18 UTC 2013 [04:13:03] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:14:43] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Fri Mar 29 04:14:34 UTC 2013 [04:15:03] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [04:15:37] !log aaron synchronized php-1.21wmf12/includes/job/JobQueueGroup.php 'deployed 651884d4bdc76c29dd717dfdbbd698632223f3b5 ' [04:15:45] Logged the message, Master [04:16:18] !log aaron synchronized php-1.21wmf12/maintenance/runJobs.php 'deployed 651884d4bdc76c29dd717dfdbbd698632223f3b5' [04:16:25] Logged the message, Master [04:18:07] Aaron|home: the wikidata updater is actually just inserting ChangeNotification jobs [04:19:07] WikiPageUpdater has scheduleRefreshLinks [04:19:13] which might resolve to refreshlinks [04:19:13] PROBLEM - Puppet freshness on mw1160 is CRITICAL: Puppet has not run in the last 10 hours [04:19:56] well that's directly what it inserts [04:20:01] 02:49:58 Posted 1000 changes to enwiki, up to ID 14970265, timestamp 20130328102638. Lag is 59000 seconds. Next ID is 14970265. [04:20:02] 02:58:28 Posted 1000 changes to enwiki, up to ID 14971265, timestamp 20130328103006. Lag is 59302 seconds. Next ID is 14971265. [04:20:02] 02:58:31 Posted 1000 changes to enwiki, up to ID 14972265, timestamp 20130328103325. Lag is 59106 seconds. Next ID is 14972265. [04:20:30] refreshLinks per-title jobs for several titles [04:20:37] I wonder what the batch limit is [04:20:46] those are each individual change notif jobs that i suppose could turn into thousands of refreshlinks jobs [04:21:05] 1000 batch limit for change notifications [04:21:18] so it seems [04:21:21] what log are you looking at? [04:21:59] hume:/var/log/wikidata/dispatcher.log /var/log/wikidata/dispatcher2.log [04:22:13] ah, hume, right [04:22:39] Tim-away asked them not to reparse everything every time a langlink is changed [04:22:45] I'm hoping that change I deployed will reduce the instances of RL jobs being spawned before the existing ones finish [04:23:26] I was aiming to do that before but the code didn't handle the case where runJobs already started, which should be handled now [04:23:40] will that slow that actual insertion of refreshlinks jobs? [04:23:42] * Aaron|home was wondering why sometimes the green/blue lines on graphite wildly didn't match [04:24:10] it will slow it down to the rate of "how fast it can parse and finish all the existing jobs" before converting more RL2 jobs to RL jobs [04:24:23] that and the profiling collector was fairly broken for a while [04:24:30] I originally wanted that just to reduce the width of the queue, but it also slows it down [04:24:52] I still have to get someone to review https://gerrit.wikimedia.org/r/#/c/56522/ [04:25:25] are the ChangeNotification jobs also low priority? [04:26:47] yes [04:29:30] re: the comment in that review to use casting for $count, isn't it already cast as an int from incr()? [04:30:16] he meant use (int) instead of intval() [04:30:33] is either necessary though? [04:30:37] otherwise the world might end [04:30:56] you know I just moved that code from elsewhere [04:31:10] I didn't want to change it around for fun reasons in that same commit [04:32:13] you touched it though! heh [04:32:24] i would delete that line for fun [04:33:01] this looks reasonable to me though [05:04:21] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [05:06:31] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [05:07:01] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 29060 MB (3% inode=99%): [05:13:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:14:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [05:28:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:29:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [05:33:40] New patchset: Rfaulk; "mod. use get_project_host_map method to generate map for project to host key." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56576 [05:46:08] New patchset: J; "install lilypond on apache nodes (used by Score extension)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56577 [05:51:34] PROBLEM - SSH on lvs1001 is CRITICAL: Connection timed out [05:51:35] PROBLEM - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [05:51:36] PROBLEM - LVS HTTPS IPv4 on foundation-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [05:51:36] PROBLEM - LVS HTTPS IPv4 on wikibooks-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [05:51:37] PROBLEM - LVS HTTP IPv4 on mediawiki-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [05:51:37] PROBLEM - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: Connection timed out [05:51:38] PROBLEM - LVS HTTP IPv4 on wikibooks-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [05:51:38] PROBLEM - LVS HTTPS IPv4 on mobile-lb.eqiad.wikimedia.org is CRITICAL: Connection timed out [05:52:22] uh [05:52:34] RECOVERY - SSH on lvs1001 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [05:52:34] RECOVERY - LVS HTTPS IPv4 on wikibooks-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 62746 bytes in 0.038 second response time [05:52:34] RECOVERY - LVS HTTP IPv4 on mediawiki-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.0 200 OK - 62740 bytes in 0.004 second response time [05:52:34] RECOVERY - LVS HTTP IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 16451 bytes in 0.010 second response time [05:52:36] RECOVERY - LVS HTTP IPv4 on wikibooks-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.0 200 OK - 62740 bytes in 0.014 second response time [05:52:36] RECOVERY - LVS HTTPS IPv4 on foundation-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 62746 bytes in 0.047 second response time [05:52:36] RECOVERY - LVS HTTPS IPv6 on mobile-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 16485 bytes in 0.024 second response time [05:52:38] RECOVERY - LVS HTTPS IPv4 on mobile-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 16492 bytes in 0.019 second response time [05:53:11] ok, why'd you kill mobile asher ? [05:53:13] and lvs1001 [05:55:46] so many reasons but hear goes.. 1 - heeeyyy [05:58:06] hehe [05:58:11] okay, i sent off a quick email [05:59:17] bye [05:59:20] ah, I just saw it (I did not even [05:59:26] see these til now) [05:59:36] still waking up.. [06:04:53] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [06:06:03] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [06:06:33] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 29582 MB (3% inode=99%): [06:30:13] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Fri Mar 29 06:30:03 UTC 2013 [06:30:53] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [06:30:53] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Fri Mar 29 06:30:46 UTC 2013 [06:31:53] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [06:31:53] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Fri Mar 29 06:31:52 UTC 2013 [06:32:53] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [06:59:47] PROBLEM - Puppet freshness on virt3 is CRITICAL: Puppet has not run in the last 10 hours [07:05:25] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [07:07:05] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 29131 MB (3% inode=99%): [07:07:35] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [07:15:02] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Fri Mar 29 07:14:46 UTC 2013 [07:15:25] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [07:21:35] PROBLEM - Apache HTTP on mw1111 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:21:36] PROBLEM - Apache HTTP on mw1112 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:21:36] PROBLEM - Apache HTTP on mw1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:21:36] PROBLEM - Apache HTTP on mw1052 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:21:36] PROBLEM - Apache HTTP on mw1104 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:21:36] PROBLEM - Apache HTTP on mw1167 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:21:55] PROBLEM - Apache HTTP on mw1082 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:06] PROBLEM - Apache HTTP on mw1019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:06] PROBLEM - Apache HTTP on mw1059 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:15] PROBLEM - Apache HTTP on mw1174 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:15] PROBLEM - Apache HTTP on mw1073 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:15] PROBLEM - Apache HTTP on mw1113 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:15] PROBLEM - Apache HTTP on mw1078 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:15] PROBLEM - Apache HTTP on mw1100 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:16] PROBLEM - Apache HTTP on mw1098 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:16] PROBLEM - Apache HTTP on mw1105 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:17] PROBLEM - Apache HTTP on mw1074 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:17] PROBLEM - Apache HTTP on mw1108 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:18] site is rather slow right now... [07:22:19] PROBLEM - Apache HTTP on mw1083 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:19] PROBLEM - Apache HTTP on mw1212 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:19] PROBLEM - Apache HTTP on mw1040 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:19] PROBLEM - Apache HTTP on mw1166 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:20] PROBLEM - Apache HTTP on mw1028 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:21] PROBLEM - Apache HTTP on mw1067 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:21] PROBLEM - Apache HTTP on mw1081 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:26] PROBLEM - Apache HTTP on mw1086 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:26] PROBLEM - Apache HTTP on mw1188 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:26] PROBLEM - Apache HTTP on mw1187 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:26] PROBLEM - Apache HTTP on mw1214 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:26] PROBLEM - Apache HTTP on mw1171 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:27] PROBLEM - Apache HTTP on mw1068 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:27] PROBLEM - Apache HTTP on mw1060 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:27] PROBLEM - Apache HTTP on mw1106 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:27] PROBLEM - Apache HTTP on mw1101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:28] PROBLEM - Apache HTTP on mw1075 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:29] PROBLEM - Apache HTTP on mw1026 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:32] PROBLEM - Apache HTTP on mw1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:32] PROBLEM - Apache HTTP on mw1090 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:33] PROBLEM - Apache HTTP on mw1050 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:34] PROBLEM - Apache HTTP on mw1216 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:34] PROBLEM - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:37] PROBLEM - Apache HTTP on mw1018 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:37] PROBLEM - Apache HTTP on mw1079 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:37] PROBLEM - Apache HTTP on mw1055 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:37] PROBLEM - Apache HTTP on mw1164 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:37] PROBLEM - Apache HTTP on mw1064 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:37] PROBLEM - Apache HTTP on mw1186 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:37] PROBLEM - Apache HTTP on mw1109 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:37] PROBLEM - Apache HTTP on mw1097 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:37] PROBLEM - Apache HTTP on mw1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:37] PROBLEM - Apache HTTP on mw1168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:39] PROBLEM - Apache HTTP on mw1096 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:39] PROBLEM - Apache HTTP on mw1103 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:40] PROBLEM - Apache HTTP on mw1080 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:40] PROBLEM - LVS HTTPS IPv4 on wikivoyage-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:40] RECOVERY - Apache HTTP on mw1104 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 8.617 second response time [07:22:40] RECOVERY - Apache HTTP on mw1167 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 9.011 second response time [07:22:42] PROBLEM - Apache HTTP on mw1057 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:42] PROBLEM - Apache HTTP on mw1185 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:44] PROBLEM - Apache HTTP on mw1025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:44] PROBLEM - Apache HTTP on mw1032 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:44] PROBLEM - Apache HTTP on mw1056 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:44] PROBLEM - Apache HTTP on mw1099 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:44] PROBLEM - Apache HTTP on mw1039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:45] RECOVERY - Apache HTTP on mw1082 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.052 second response time [07:22:45] PROBLEM - Apache HTTP on mw1102 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:55] RECOVERY - Apache HTTP on mw1059 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.125 second response time [07:23:05] RECOVERY - Apache HTTP on mw1113 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.049 second response time [07:23:05] RECOVERY - Apache HTTP on mw1174 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.051 second response time [07:23:05] RECOVERY - Apache HTTP on mw1078 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.056 second response time [07:23:05] RECOVERY - Apache HTTP on mw1073 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.087 second response time [07:23:05] RECOVERY - Apache HTTP on mw1098 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.393 second response time [07:23:06] RECOVERY - Apache HTTP on mw1074 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.056 second response time [07:23:06] RECOVERY - Apache HTTP on mw1105 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.061 second response time [07:23:07] RECOVERY - Apache HTTP on mw1212 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.059 second response time [07:23:07] RECOVERY - Apache HTTP on mw1108 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.067 second response time [07:23:08] RECOVERY - Apache HTTP on mw1083 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.070 second response time [07:23:08] RECOVERY - Apache HTTP on mw1040 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.085 second response time [07:23:09] RECOVERY - Apache HTTP on mw1100 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 2.642 second response time [07:23:09] RECOVERY - Apache HTTP on mw1166 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.055 second response time [07:23:11] RECOVERY - Apache HTTP on mw1067 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.057 second response time [07:23:11] RECOVERY - Apache HTTP on mw1081 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.063 second response time [07:23:11] RECOVERY - Apache HTTP on mw1028 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.076 second response time [07:23:15] RECOVERY - Apache HTTP on mw1086 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.048 second response time [07:23:15] RECOVERY - Apache HTTP on mw1188 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.053 second response time [07:23:15] RECOVERY - Apache HTTP on mw1187 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.093 second response time [07:23:15] RECOVERY - Apache HTTP on mw1214 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.058 second response time [07:23:15] RECOVERY - Apache HTTP on mw1068 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.058 second response time [07:23:16] RECOVERY - Apache HTTP on mw1060 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.093 second response time [07:23:17] RECOVERY - Apache HTTP on mw1106 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.099 second response time [07:23:23] RECOVERY - Apache HTTP on mw1171 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.104 second response time [07:23:23] RECOVERY - Apache HTTP on mw1101 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 1.476 second response time [07:23:23] RECOVERY - Apache HTTP on mw1026 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.106 second response time [07:23:23] RECOVERY - Apache HTTP on mw1075 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 2.281 second response time [07:23:23] RECOVERY - Apache HTTP on mw1090 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.065 second response time [07:23:23] RECOVERY - Apache HTTP on mw1023 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.092 second response time [07:23:23] RECOVERY - Apache HTTP on mw1050 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.071 second response time [07:23:23] RECOVERY - Apache HTTP on mw1216 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.327 second response time [07:23:23] RECOVERY - LVS HTTP IPv4 on appservers.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 62900 bytes in 0.330 second response time [07:23:25] RECOVERY - Apache HTTP on mw1186 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.053 second response time [07:23:25] RECOVERY - Apache HTTP on mw1064 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.062 second response time [07:23:25] RECOVERY - Apache HTTP on mw1018 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.075 second response time [07:23:25] RECOVERY - Apache HTTP on mw1164 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.077 second response time [07:23:25] RECOVERY - Apache HTTP on mw1097 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.078 second response time [07:23:27] RECOVERY - Apache HTTP on mw1055 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.067 second response time [07:23:27] RECOVERY - Apache HTTP on mw1079 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.095 second response time [07:23:27] RECOVERY - Apache HTTP on mw1109 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.110 second response time [07:23:27] RECOVERY - Apache HTTP on mw1168 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.059 second response time [07:23:28] RECOVERY - Apache HTTP on mw1027 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.056 second response time [07:23:29] RECOVERY - Apache HTTP on mw1096 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.069 second response time [07:23:29] RECOVERY - Apache HTTP on mw1103 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.064 second response time [07:23:29] RECOVERY - Apache HTTP on mw1112 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.066 second response time [07:23:30] RECOVERY - Apache HTTP on mw1020 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.063 second response time [07:23:30] RECOVERY - Apache HTTP on mw1080 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.085 second response time [07:23:32] RECOVERY - LVS HTTPS IPv4 on wikivoyage-lb.pmtpa.wikimedia.org is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 757 bytes in 0.202 second response time [07:23:32] RECOVERY - Apache HTTP on mw1032 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.049 second response time [07:23:33] RECOVERY - Apache HTTP on mw1025 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.054 second response time [07:23:33] RECOVERY - Apache HTTP on mw1185 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.055 second response time [07:23:33] RECOVERY - Apache HTTP on mw1056 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.051 second response time [07:23:33] RECOVERY - Apache HTTP on mw1039 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.055 second response time [07:23:35] RECOVERY - Apache HTTP on mw1052 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.073 second response time [07:23:35] RECOVERY - Apache HTTP on mw1099 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.071 second response time [07:23:35] RECOVERY - Apache HTTP on mw1057 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.088 second response time [07:23:35] RECOVERY - Apache HTTP on mw1111 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 3.792 second response time [07:23:36] RECOVERY - Apache HTTP on mw1102 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.070 second response time [07:23:55] RECOVERY - Apache HTTP on mw1019 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 747 bytes in 0.050 second response time [08:07:54] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [08:09:34] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 28581 MB (3% inode=99%): [08:10:05] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [08:14:34] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Fri Mar 29 08:14:31 UTC 2013 [08:15:04] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [08:20:35] New patchset: Odder; "(bug 45643) Add new user groups to urwiki with specific rights Add abusefilter and rollbacker user groups, modify $wgAddGroups for crats and sysops, modify $wgRemoveGroups for crats" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56578 [08:26:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:27:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [09:01:18] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:02:08] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.122 second response time [09:05:32] New patchset: Dereckson; "(bug 46686) Throttle rule for gu. workshop" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56582 [09:06:15] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [09:07:25] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [09:07:55] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 29012 MB (3% inode=99%): [09:20:14] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56582 [09:25:30] !log olivneh synchronized wmf-config/throttle.php 'Updating throttle rules for guwiki workshop (Bug 46686)' [09:25:36] Logged the message, Master [09:40:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:41:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.134 second response time [10:06:07] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [10:07:17] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [10:07:47] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 28434 MB (3% inode=99%): [10:18:11] New review: PleaseStand; "(5 comments)" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/56408 [10:22:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:23:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.142 second response time [10:23:31] hello [11:05:44] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [11:07:24] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 27861 MB (3% inode=99%): [11:07:54] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [11:21:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:22:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.128 second response time [11:39:10] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [11:39:10] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [11:39:10] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [11:52:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:52:37] New patchset: Hashar; "beta: restore commonswiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56593 [11:53:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [11:53:52] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56593 [11:57:28] New patchset: Hashar; "beta: commonswiki was missing the MediaWiki version" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56594 [11:58:06] Change merged: jenkins-bot; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56594 [11:58:15] New review: Hashar; "Typo in wikiversions is fixed by https://gerrit.wikimedia.org/r/#/c/56594/" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56593 [12:04:03] New review: Aklapper; "Not sure if bugzillaadmin@ an existing alias and if really every Bugzilla admin wants to get spammed..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56562 [12:06:22] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [12:08:02] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Fri Mar 29 12:07:54 UTC 2013 [12:08:22] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [12:08:32] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [12:09:02] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Fri Mar 29 12:08:52 UTC 2013 [12:09:02] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 27392 MB (3% inode=99%): [12:09:22] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [12:09:52] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Fri Mar 29 12:09:45 UTC 2013 [12:10:22] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [12:10:39] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Fri Mar 29 12:10:31 UTC 2013 [12:11:02] PROBLEM - Puppet freshness on cp3010 is CRITICAL: Puppet has not run in the last 10 hours [12:11:22] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [12:11:52] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Fri Mar 29 12:11:43 UTC 2013 [12:12:22] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [12:14:42] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Fri Mar 29 12:14:34 UTC 2013 [12:15:22] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [12:21:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:22:13] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [12:23:02] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [13:05:27] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [13:07:36] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [13:07:41] poor db11 [13:08:06] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 26936 MB (3% inode=99%): [13:09:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:10:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [13:14:12] New patchset: Hashar; "0.6.1-2 gbp.conf and tweaks" [operations/debs/python-voluptuous] (master) - https://gerrit.wikimedia.org/r/56168 [13:14:34] New review: Hashar; "I have moved the PHONY statement at the end of debian/rules" [operations/debs/python-voluptuous] (master) - https://gerrit.wikimedia.org/r/56168 [13:26:02] New review: Hashar; "(4 comments)" [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/55069 [13:26:22] New patchset: Hashar; "Inital deb packaging" [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/55069 [13:27:27] New review: Hashar; "PS2 fix all the minor issues reported by Faidon. That brings this package up to par with the python-..." [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/55069 [13:51:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:52:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.140 second response time [14:06:06] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [14:08:16] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [14:08:46] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 27405 MB (3% inode=99%): [14:19:56] PROBLEM - Puppet freshness on mw1160 is CRITICAL: Puppet has not run in the last 10 hours [14:22:16] hmm, get I get second eyes on this change from Ryan F? [14:22:16] https://gerrit.wikimedia.org/r/#/c/56576/1/templates/misc/e3-metrics.settings.py.erb [14:22:26] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:22:43] it downloads config from operations/mediawiki-config.git gerrit gitweb [14:23:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [14:23:27] is that ok? should I tell him to clone that repo and just read from a file? is that even any better? [14:28:07] New patchset: Hashar; "Inital deb packaging" [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/55069 [14:33:57] New review: Hashar; "PS3:" [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/55069 [14:34:49] New patchset: Hashar; "Inital deb packaging" [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/55069 [14:35:07] New review: Hashar; "PS4: removes override_dh_auto_test , there are no doc tests :-]" [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/55069 [14:49:51] New review: Hashar; "And the package no more works :(" [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/55069 [14:58:03] ottomata: is it a cron? [14:58:17] no [14:58:21] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:58:29] its a python web app [14:58:32] runs in apache as the stats user [14:58:52] so what triggers that function? [14:59:02] app startup [14:59:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.142 second response time [14:59:18] or maybe even first page request, i'm not sure. i'm pretty sure its app startup [14:59:39] the app caches the configs in a persistent file somehow [14:59:49] so most of the time it won't make the request [14:59:51] but still [15:02:11] hrmmm, so 2 thoughts [15:02:22] 1) if you're going to use HTTP then just use noc.wm.o/conf [15:02:54] (and then you can even use HTTP headers like if-modified-since) [15:03:00] yeah that's nice, i'll tell him that [15:03:09] cool! i doubt either of us know that exists [15:03:25] oh, do you want to just comment on the change? (I can do it if you don't want to) [15:03:35] https://gerrit.wikimedia.org/r/#/c/56576/ [15:03:43] 2) not sure about using a git clone or not [15:05:25] New patchset: Hashar; "Reimported Upstream version 1.5.8" [operations/debs/python-statsd] (upstream) - https://gerrit.wikimedia.org/r/56600 [15:05:51] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [15:06:00] ARGHHHH [15:06:12] erm? :) [15:06:46] I pushed to gerrit some changes [15:06:49] but later on did a git push [15:06:57] and abandonned the changes [15:07:03] thus gerrit consider them unmerged [15:07:04] :( [15:07:07] but they are in the repo [15:07:29] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 27955 MB (3% inode=99%): [15:07:44] ^demon: ping pong :-] Are you able to alter the Gerrit database to flag a change has being merged? [15:07:49] ottomata: i'm quoting you [15:07:56] <^demon> hashar: Yeah, change #? [15:07:57] ^demon: they are registered changes but got them pushed directly in the repo. [15:07:59] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [15:08:03] New review: Ottomata; "Hiiii Ryan!" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56576 [15:08:13] New review: Jeremyb; "(1 comment)" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/56576 [15:08:21] ^demon: https://gerrit.wikimedia.org/r/#/c/55038/ and https://gerrit.wikimedia.org/r/#/c/55039 [15:08:36] erm [15:08:45] ^demon: I checked their latest patchset sha1. They are present in the repo already (branch `upstream`) [15:08:46] * jeremyb_ meets ottomata in midair [15:09:07] hehe [15:09:33] ottomata: btw, idk why people do that, in nick form i'm all lowercase :-) [15:09:43] haha [15:09:45] ok good to konw [15:09:48] you like the _ too? [15:09:48] * jeremyb_ glares @ Amgine [15:09:51] no [15:09:59] no _. that's only for Logan [15:12:49] <^demon> hashar: Update status on both. [15:12:52] <^demon> *updated [15:14:44] ^demon: my hero [15:14:48] <^demon> yw [15:15:40] now my change is mergeeable yeahhh [15:16:12] New patchset: Lcarr; "fixing script to use /var/run/icinga" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56601 [15:16:23] ottomata: would you mind merging a change for me ? My python-statsd package has a wrong `upstream` branch. When running git-import origin I used an incorrect tar ball. Change is https://gerrit.wikimedia.org/r/#/c/56600/ :-] [15:16:49] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56601 [15:17:12] Change merged: Ottomata; [operations/debs/python-statsd] (upstream) - https://gerrit.wikimedia.org/r/56600 [15:17:17] hashar, done [15:17:22] danke [15:23:13] New patchset: Hashar; "Merge `upstream` Reimported Upstream version 1.5.8" [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/56602 [15:24:27] New patchset: Hashar; "Inital deb packaging" [operations/debs/python-statsd] (master) - https://gerrit.wikimedia.org/r/55069 [15:27:09] New review: Hashar; "Turns out the upstream branch was having an incorrect package." [operations/debs/python-statsd] (master); V: 1 C: 1; - https://gerrit.wikimedia.org/r/55069 [15:27:09] ah, i just figured out why my ganglia thing wasn't working [15:27:23] it needs to read a proc file and doesn't have permissions [15:27:50] that is unfortunate [15:27:58] mmhmmm [15:28:13] not sure if I can fix that one, i can't just change file perms on /proc [15:28:14] ori-l [15:28:18] any thoughts on that? [15:28:21] this is for the udp2log socket stats [15:28:28] what is the /proc file ? [15:28:34] /proc/26970/fd [15:28:40] udp2log proc file [15:28:44] ah [15:28:51] maybe add ganglia to the udp2log group? [15:28:55] maybe not ideal though [15:29:27] naw, the proc file is owned by root 500 [15:30:42] hm I think I can do this differently, just much less elegantly [15:34:11] ottomata: is what you want not available from `netstat` ? [15:35:19] sorta, i mean, not on a per process level (is it?) [15:35:48] this is code ori-l wrote to get udp socket stats from udp2log processes [15:35:57] i can get the info out of /proc/net/udp [15:36:31] but [15:36:41] ori-l is using /proc to find the socket inodes [15:36:50] to figure out which lines in /proc/net/udp are udp2log processes [15:36:59] there are other ways to figure that out [15:37:10] grepping for hex of listen port is one? [15:42:07] ottomata: can't you shell out to lsof ? [15:43:37] same deal i think, (lsof uses /proc???) [15:43:38] udp2log 26970 udp2log NOFD /proc/26970/fd (opendir: Permission denied) [15:44:11] ottomata: suid udp2log ? [15:44:27] naw, udp2log doesn't have perms either [15:44:28] just root [15:44:34] idk [15:44:43] ignore what it says about 500 [15:44:47] try it for real [15:44:52] sudo su - udp2log [15:45:06] (or leave out the sudo if you're already root!) [15:45:55] oh, hrmmm [15:46:05] naw, even as udp2log user [15:46:06] udp2log 26970 udp2log NOFD /proc/26970/fd (opendir: Permission denied) [15:46:23] grrr [15:46:24] it looks like lsof reads /proc/.../fd to figure its stuff out [15:46:24] so [15:46:32] also, netstat wants me to be root too :p [15:51:06] i think i'm going to have to do this: [15:51:18] grep either ps or udp2log init.d file for the port that udp2log is listening on [15:51:25] convert port to hex [15:51:31] grep /proc/net/udp for hex port [15:55:47] New patchset: Jeremyb; "one more script to use /var/run/icinga" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56604 [15:56:21] New review: Dzahn; "it is an alias. i added it, to keep things generic" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56562 [15:58:16] * jeremyb_ repokes about https://rt.wikimedia.org/Ticket/Display.html?id=4761 [16:00:08] jeremyb_: thank you [16:01:28] New review: Jeremyb; "fu Iaf9dcb9ab7574ce79c74f559c48d0c1d31fb51be" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56601 [16:06:01] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:06:47] greg-g: did you get sorted?? [16:07:10] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [16:07:10] jeremyb_: didn't work last I checked yesterday, let's see if it is a midnight cronjob ;) [16:07:14] * jeremyb_ wonders again about reopening graphite (or adding a group or something) [16:07:22] greg-g: i don't think it is [16:07:36] :) nope [16:07:40] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 27384 MB (3% inode=99%): [16:08:00] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Fri Mar 29 16:07:57 UTC 2013 [16:08:37] mutante: if I can bug you again about my credentials... graphite still won't let me in. Also, relatedly, robla and I agree it'd be good to get me a RT account/access. Should I email rt these issues? :) [16:09:00] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:09:10] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Fri Mar 29 16:09:02 UTC 2013 [16:10:00] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:10:10] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Fri Mar 29 16:10:01 UTC 2013 [16:10:57] greg-g: i don't know about graphite, but i do about RT, you already have an account, created yesterday:) PMing you details [16:11:00] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:11:00] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Fri Mar 29 16:10:55 UTC 2013 [16:11:08] oh, nice! [16:12:00] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:12:25] heh [16:12:30] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Fri Mar 29 16:12:22 UTC 2013 [16:13:00] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:13:00] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Fri Mar 29 16:12:55 UTC 2013 [16:14:00] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:14:40] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Fri Mar 29 16:14:32 UTC 2013 [16:15:00] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [16:16:02] jeremyb_: what was that other site you had me test yesterday? [16:16:15] icinga-admin/ishmael [16:16:19] ishmael, right [16:16:20] thanks [16:16:45] New patchset: Lcarr; "fixing pid file" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56605 [16:18:22] * greg-g made his first ever RT ticket [16:24:22] Change merged: Lcarr; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56605 [16:24:39] greg-g, congratulations. next step would be to receive a notification from a ticket you can't see:P [16:24:54] the way to nirvana is long [16:26:38] MaxSem: sounds... wonderful. [16:28:15] * jeremyb_ stabs LeslieCarr! [16:28:22] ow! [16:28:30] hi, who should I talk to temporarily get a new public SSH key added to stat1? my laptop is dead and I'd like to be able to access stat1 before it's fixed [16:28:40] LeslieCarr: https://gerrit.wikimedia.org/r/56604 [16:28:54] jgonera: put in an rt ticket [16:29:15] hashar: i also made the change and committed it [16:29:16] hehe [16:30:35] LeslieCarr, how do I register in rt? [16:30:43] jgonera, if your laptop is in repair, all keys you have on it should be eternally invalidated [16:30:52] jgonera: just send an email to ops-requests@rt.wikimedia.org [16:31:05] MaxSem, that's actually a good point [16:31:07] jgonera: don't need to register [16:31:12] oh, ok [16:31:19] New review: Hashar; "(2 comments)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/56348 [16:31:34] jgonera: once the mail is received you can then do a password reset @ rt.wikimedia.org [16:33:06] yay i like it when jeremyb_ answers questions :) [16:33:34] ++:) [16:33:41] hehe [16:36:48] New review: Hashar; "Sorry. I was referring to the JJB job `operations-puppet-doc ` which is https://gerrit.wikimedia.org..." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/53958 [16:36:54] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55872 [16:39:03] weird (I finally cleaned out all of the Ops Gerrit messages so now I'm seeing) the svn messages are html-only email? That seems odd to me. [16:53:38] New patchset: Hashar; "contint: Move docs.pp into contint" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53958 [16:53:51] New review: Hashar; "manual rebase" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53958 [16:56:05] New review: Demon; "This is ready now." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/53759 [16:56:34] New review: Demon; "recheck" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54989 [16:59:57] PROBLEM - Puppet freshness on virt3 is CRITICAL: Puppet has not run in the last 10 hours [17:00:15] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53759 [17:00:53] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54989 [17:01:14] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55057 [17:02:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:02:36] greg-g: i don't follow [17:02:44] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/55259 [17:03:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 1.037 second response time [17:03:28] jeremyb_: remove the bits between parens, the email from svn@sockpuppet are html only, it seems [17:03:44] yeah "svn messages" was a little vague [17:04:48] Change merged: Ryan Lane; [operations/debs/gerrit] (master) - https://gerrit.wikimedia.org/r/56485 [17:05:28] New patchset: Hashar; "contint: puppet doc now handled by Jenkins" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53958 [17:05:48] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [17:07:58] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [17:08:28] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 26587 MB (3% inode=99%): [17:08:31] jeremyb_: the part between parens was just me saying how I am finally cleaning up the ops mailing list mail I get that is just noise to me (I created an OpsNoise folder for gerrit, puppet, amanda, mscloud messages ;-) ) [17:08:50] greg-g: oh, i've never gotten mail from sockpuppet nor stafford so idk :) [17:08:58] greg-g: do you get stafford too? [17:09:13] greg-g: you don't even get the really noisy stuff ;) [17:09:47] cronjob ever minute! [17:10:03] greg-g: i don't see how gerrit is opsnoise [17:10:11] i wonder what mscloud is [17:10:57] microsoft cloud! [17:11:05] New review: Hashar; "The first succesful jenkins job is https://integration.wikimedia.org/ci/view/Operations/job/operatio..." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53958 [17:11:51] but why is it used? :) [17:12:01] I've no idea if it is that [17:12:12] Nimsoft Cloud Monitor [17:12:29] <^demon> Replacing etherpad with ms word...in the cloud? [17:12:32] used for Icinga stuff [17:12:32] sounds more feasible :p [17:12:35] hah [17:12:42] marktraceur: see ^demon ^ [17:12:46] <^demon> Nimsoft != icinga [17:12:49] ^demon: Office 365 [17:12:53] <^demon> icinga == nagios replacement. [17:13:05] <^demon> nimsoft == site formerly known as watchmouse, does third-party monitoring [17:13:09] Thehelpfulone: yeah, no connection to icinga. i hope. if there's a connection than that's a bug [17:13:18] (besides that nimsoft monitors icinga) [17:13:25] monitor the monitoring [17:13:28] test the tests [17:13:32] <^demon> monitor all the monitors? [17:13:53] LeslieCarr: just wondering if you saw this... (it's your week) * jeremyb_ repokes about https://rt.wikimedia.org/Ticket/Display.html?id=4761 [17:14:33] jeremyb_: well, the gerrit messages sent to ops is noise for me, there's a ton of it and I don't need to read it [17:14:44] reedy and ^demon got your other questions :) [17:14:56] greg-g: The commits to operations/puppet? [17:15:06] There are some svn commits for DNS entries [17:15:15] cause that's still oldschool [17:15:21] greg-g: errr, ops gets gerrit on list in addition to personal subscriptions? weird [17:15:22] Reedy: yeah [17:15:31] jeremyb_: yeah, I thought so too ;) [17:15:40] It's how it's always been [17:15:48] ^demon: but doesn't tell us what the problem is when we're running at 60% in france :( [17:15:51] I'm guessing it's wanted, as otherwise someone would've complained and/or fixed it [17:16:21] Reedy: yeah, I mean, it probably makes sense for their use case, but not for platform's [17:16:29] Tough, it's their mailing list? ;) [17:16:31] <^demon> I'm curious if it's actually all that wanted. [17:17:07] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/53958 [17:17:12] I should go back inside [17:17:32] i need better mail filtering [17:17:37] jeremyb_: imapfilter! [17:17:40] /dev/null WFM [17:17:43] heh [17:17:52] 0 inbox too [17:17:54] Bonus! [17:18:07] #inboxzeroalldaymuthereffer [17:18:10] <^demon> greg-g: Nobody's deploying anything right now are they? [17:18:13] greg-g: do you use google apps? or the old setup? [17:18:26] ^demon: it's friday, they shouldn't be [17:18:49] jeremyb_: my wmf email is hosted by google, yes, but I don't use the web interface [17:19:18] <^demon> greg-g: We've got a gerrit update ready to roll. Activity seems pretty slow, and this update's long overdue. Any objections? [17:19:22] greg-g: i mean there's wmf people that have imap not at google [17:19:29] ^ [17:19:40] < 5 I think [17:19:51] ^demon: none from me, gerrit's not WP... ;) [17:19:52] possible :) [17:20:12] yeah, I think all new people get google :/ [17:20:16] <^demon> sweet, here we go. [17:20:21] brb, real work for a bit [17:23:03] <^demon> !log gerrit offline for an update, brb [17:23:10] Logged the message, Master [17:25:16] marktraceur: http://blogs.law.harvard.edu/sj/2013/03/28/annotation-hacks-hypothesis-xxx-begins-to-converge/ [17:33:37] New patchset: Demon; "Need single quotes since ${name} is literal" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56612 [17:36:19] jeremyb_: Ha! I would love to give SJ his always-up always-reliable EPL instance. [17:38:11] New patchset: Ottomata; "Looking up udp2log socket stats by port instead of socket inode." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56614 [17:40:10] * jeremyb_ stabs gerrit [17:40:15] ^demon: are you done? [17:40:17] New patchset: Demon; "Need single quotes since ${name} is literal" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56612 [17:40:37] <^demon> Mostly. [17:41:43] Change abandoned: Demon; "The hell?" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56612 [17:42:17] New patchset: Aaron Schulz; "Re-enabled async upload for commons." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56615 [17:42:48] i wonder if you logged everyone out or just me [17:43:35] http://i.imgur.com/GGZWWKs.png :) [17:43:38] <^demon> Caches don't survive a schema upgrade. [17:44:07] PROBLEM - Host db29 is DOWN: PING CRITICAL - Packet loss = 100% [17:44:10] gerrit took a while to make up its mind whether I was logged in or not :) [17:44:54] heh [17:44:58] !log aaron synchronized wmf-config/InitialiseSettings.php 'Re-enabled async upload for commons' [17:45:05] Logged the message, Master [17:45:35] <^demon> Ugh. Things are merging but not saying they're merged. [17:45:52] I just noticed that [17:46:07] you mean in channel or even not saying it on the web? [17:46:09] what about email? [17:46:17] I saw "merge pending" but it pulled down and I was like "wahht?" [17:46:24] ;) [17:47:37] !log taking down db29 to replace bad hw [17:47:43] Logged the message, Master [17:48:34] <^demon> Ah, it's a plugin. [17:49:47] RECOVERY - Host db29 is UP: PING OK - Packet loss = 0%, RTA = 26.50 ms [17:51:05] marktraceur: so there is no way to re-upload using UW? [17:51:38] AaronSchulz: Not to my knowledge, no [17:51:45] gah [17:54:36] Reedy can you create the wiki for https://rt.wikimedia.org/Ticket/Display.html?id=4850 - I believe you've done the last few? [17:55:33] mutante: ^ Could you do the DNS and apache entries please? [17:57:48] <^demon> AaronSchulz: Ok, merging fixed. [17:59:32] New patchset: Demon; "I hate when people change parameter names" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56616 [17:59:43] ok folks, some raid troubleshooting going on, so doing it in here in case anyone is interested how to do it [18:00:02] in this case, its a sun server, so adaptec controller, and using arcconf [18:00:18] sbernardin: So basically, you just have to memorize what uitilites run which controllers [18:00:29] or in the worst case, recall all the utilities and just try them [18:00:32] !log DNS update - adding transitionteam wiki [18:00:33] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56616 [18:00:40] cuz if you try to use megacli (what dell perc uses) on this, it simply wont work [18:00:43] Logged the message, Master [18:00:44] you wont actually kill something. [18:00:47] (usually) [18:00:52] what's this udp2log crap? [18:00:59] Reedy: DNS done .. Apache soon [18:01:02] Ok [18:01:11] ottomata: did you merge something and not merge it on sockpuppet? [18:01:20] so in this case: arcconf getconfig 1 [18:01:28] that shows you all the configuration on the controller [18:01:31] and lists every disk [18:01:38] you can type in just arcconf to get all the options [18:01:55] (i also cheat and google arcconf cheat sheet if i have to do things like swap disks without autorebuild) [18:02:17] sbernardin: So we can see the controller is ok, but the raid is degraded [18:02:27] and the 3'rd disk (under logical device info) is missing [18:03:00] Ryan_Lane, maybe? gerrit was being real weird [18:03:04] didn't look like it merged [18:03:07] but it is safe to merge [18:03:19] ah, yeah, refresh of gerrit shows it is merged now [18:03:20] sbernardin: So, i would copy down the output of that commend [18:03:21] command [18:03:24] lemme know if you did [18:03:32] specifically what is listed under Logical device segment information [18:03:49] sbernardin: then, there should be a disk that isn't flashing its device io light when the others do [18:04:04] if you watch it for about 30 seconds, usually every disk will flash once [18:04:25] <^demon> !log gerrit back up, now running 2.6-rc0-76-g52fb5ae [18:04:27] then you should be able to see which is dead, pull it, and ensure its Serial Number is NOT one of the ones listed in the logical device segment [18:04:32] Logged the message, Master [18:04:38] sbernardin: but if you cannot tell by seeing which one doesnt flash, then we can do other things [18:04:40] but thats the easiest [18:04:43] OK [18:04:53] now, if the disk wasnt totally dead [18:04:59] we would be able to see its serial to confirm it when you remove [18:05:06] but since it is, we just have to confirm the serial ISNT on our output list [18:05:10] which means we have the right disk [18:05:11] make sense? [18:05:29] hey guys, do I need approval/RT ticket to change someone's ssh key? [18:05:30] Reedy: any opinion on which Apache conf? main.conf vs. remnant.conf? both would work, other private wikis are in remnant, but if it's new, it can hardly be a remnant:) [18:05:55] Yup....got it [18:05:59] ottomata: we need some way to confirm its really that person who is asking, so RT is usually first step [18:06:13] k [18:06:14] but its a hard question, sicne without handoff, how are you sure its their key ;] [18:06:38] if you can do an RT, you can detail how you know its their key, etc. [18:06:50] ottomata: if you think you can ask them to do that, gpg siging is a really nice way to confirm [18:06:52] i'd put it in RT and then file a changeset and link it to the RT ticket personally [18:06:54] ah, RobH, there is an RT [18:06:54] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [18:06:54] https://rt.wikimedia.org/Ticket/Display.html?id=4854 [18:07:12] ok [18:07:24] ottomata: hrmm, i hope this dude makes this key not temp but just his key [18:07:32] but whatevs, hrmm [18:07:32] ottomata: ok. I had merged it anyway :) [18:07:40] yeah, his most recent email (i'll paste in RT) says he's going to keep it [18:07:56] ottomata: it's just a little annoying to have to check to make sure it isn't malicious [18:08:11] ottomata: So yea, if you know this is legit, because you have had long conversations with him on toher things to prove its him [18:08:15] (se ewhat i mean?) [18:08:20] then you should comment in ticket that its legit [18:08:25] (from your perspective) [18:08:30] because none of us know this dude. [18:08:40] (no offense to him intended) [18:09:04] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [18:09:05] hm, i guess dario should comment then? [18:09:10] I don't know him either [18:09:19] well, someone who knows that is a legit key for him should [18:09:33] better off would be as mutante says and sign keys [18:09:33] k [18:09:34] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 27073 MB (3% inode=99%): [18:09:43] but at minimum we want a social check in place [18:09:47] (not great, but meh) [18:12:51] hmk [18:13:06] New patchset: Ottomata; "Removing debug statement" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56617 [18:13:50] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56617 [18:20:01] RobH: I am almost afraid to ask but what can we do to bring an07 (that cisco box) to life? [18:20:12] anybody avail to help me figure out why my new ganglia stats aren't working? [18:20:19] i've got this error message on ganglia-monitor startup [18:20:25] Unable to find the metric information for 'rx_queue'. Possible that the module has not been loaded.#012 [18:20:31] but I can run the python module as ganglia user just fine now [18:21:13] drdee: Someone needs to troubleshoot the installer and see why its failing to make a raid array [18:21:28] cuz something isnt right, and it should be identical to the rest [18:22:01] drdee: if analytics doesnt have someone who can do this, we can find an ops person [18:22:12] which im totally going to make it cmjohnson1's problem. [18:22:19] cmjohnson1: this is what you get for wanting to learn. [18:22:46] robh: yeah..i was hoping to have a box i could take raid controller from to test it first [18:23:05] the raid isnt doing anything [18:23:05] its software raid in installer [18:23:11] so i think yer gonna have to troubleshoot in place [18:23:21] we dont really know why its failing [18:23:24] binasher: around? do you need help diagnosing that jobqueue craziness that happened yesterday? We have a wikidata deploy next week on Wed on enwp, and, well, would love to know if we need to fix something before that ;) [18:23:26] i wonder if i can use the one from labsdb1003 since it's not in service yet (binasher) [18:23:27] someone needs to parse the installer logs manually and read them [18:23:31] and its painfully boring [18:23:40] but yea, you can try if you want [18:23:52] RobH: can you email me the installer logs? [18:23:55] this is me, handing you the issue, to correct as you see fit [18:24:01] drdee: well, cmjohnson1 can [18:24:04] cuz now, its his problem [18:24:08] * RobH runs away laughing evilly [18:24:13] nicely done [18:24:20] * cmjohnson1 filling with h8 [18:24:25] no srsly, i am having to stifle evil laughter. [18:24:36] i can feel chris's hate from across the country [18:24:37] cmjohnson1 feel free to email me the installer logs [18:24:41] an1007 is my dataset1 [18:25:02] cmjohnson1: you know the ususal, you can png me to help, or anyone else you need [18:25:08] im just making sure you know yer the point person is all [18:25:14] uh huh..i know [18:25:15] thx [18:25:37] cmjohnson1: your hate for me is like sunshine, i revel in it. [18:25:57] :-P [18:26:19] ssl3004? [18:26:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:28:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [18:28:50] greg-g: can we drop to running a single wikidata change dispatcher and also reduce the batch size? this only has to be for a week or two, once the jobqueue is moved to redis, the problem goes away [18:29:15] New review: Ryan Lane; "(1 comment)" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/56104 [18:29:36] ^demon: ^^ [18:29:43] tin can't connect to irc, it's an internal host [18:30:34] I made git-deploy write its log messages into the redis queue on tin, and was going to add support to ircecho to pull from the queue [18:31:08] <^demon> Hmm. [18:31:13] binasher: could be reasonable, I'd just double check with the wikidata people [18:31:32] it should be amazingly easy to add. I can probably do so today, if needed [18:32:29] could probably made ircecho push into a queue as well [18:32:31] *make [18:32:35] <^demon> *nod* [18:32:48] then tin would read the file, push in, and another system would pop and write into irc [18:33:19] alternatively, we could have irccho listen and repeat [18:33:31] I used the queue because I was already using it [18:34:18] listen + repeat is likely less reliable [18:34:36] cmjohnson1: its ok if labsdb1003 is down for a bit [18:34:57] ori-l, around? [18:34:57] cool thx ...that helps [18:35:06] hey [18:35:09] ottomata: ^ [18:35:32] did you have any success sending your udp2log stats to ganglia? [18:35:46] ^demon: thoughts? [18:35:58] ^demon: any approach you think is better? [18:36:30] I know we had some udp listener for a while [18:37:03] <^demon> For listen + repeat, where would you repeat from? [18:37:10] ottomata: http://ganglia.wikimedia.org/latest/graph.php?r=hour&z=xlarge&title=EventLogging&vl=events+%2F+sec&x=&n=&hreg[]=vanadium.eqiad.wmnet&mreg[]=%5E%28client-generated-raw%7Cserver-generated-raw%7Cvalid-events%29%24>ype=stack&glegend=show&aggregate=1 [18:37:23] some host with a public IP [18:37:26] binasher: I just asked in #wikimedia-wikidata and the initial response is "lets not do that unless we really have to" basically. Can you email the wikidata mailing list about it? [18:37:35] the same will be needed with ircecho pulling from a queue [18:37:50] <^demon> Ryan_Lane: Figured. I suppose that's probably the path of least resistance. [18:37:54] ottomata: happy to take a look and help debug; what's the issue? [18:38:01] what, listener? [18:38:10] if we already have a listener written, yeah [18:38:11] <^demon> Yeah [18:38:14] <^demon> In fact, depending on how we set it up it could be a general service for "Inside things that want to post crap to IRC" [18:38:21] ori-l, doh, i may have just found it, one sec [18:38:56] greg-g: maybe.. lets see if we can avoid it by other means. aaron deployed a change last night that throttles how quickly the wikidata change jobs get resolved into refreshLinks jobs but its unclear if that's enough [18:39:00] <^demon> brb [18:39:01] ^demon: I know we had one for a while. not sure what happened to it [18:39:16] RoanKattouw_away: any idea what happened to that udp listener that talked to irc? [18:39:41] if it is, great. if its not, we'll cut the wikidata dispatch capacity and then notify them [18:39:50] yeahhh, totally something dumb on my part, i just saw it. i renamed the module but didn't change in the .pyconf file. [18:39:54] but, ori-l, while we're talking about it [18:39:55] my only issue with a listener is that when it's down all messages are dropped [18:40:00] binasher: gotcha. thanks. So, in your opinion (I'll consider you 'owning' this issue right now) should we continue as planned with the deploy on wed and just monitor, with no more changes between now and then? [18:40:07] I had to modify the udp2log module to not use /proc//fd [18:40:14] using the fd and lookup the inodes is way more elegant [18:40:17] if we write into a queue, when the bot comes back up, it'll log things that happened while it was down [18:40:19] greg-g: what are the implications of the deploy [18:40:19] but ganglia user doesn't have permissions to read that [18:40:50] instead i'm looking up the udp2log listen port in the cmdline, and using that to get data out of /proc/net/udp [18:40:55] works, but is much less elegant [18:40:58] and more udp2log specific [18:41:02] binasher: many more things getting in the queue, so if the queue is slow, many users not getting notifications [18:41:07] also, we can combine ircecho and adminbot functions, so, if someone writes into a specific queue, it'll write directly to wikitech, rather than spitting a message into irc [18:41:27] I hate that we have a bot that logs to a bot. it's dirty [18:41:36] and noisy [18:41:47] New patchset: Ottomata; "Fixing .pyconf module name" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56621 [18:41:58] greg-g: getting things in the queue caused two site outages in the last 24 hours [18:42:01] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56621 [18:42:36] binasher: oh, I didn't see your email until just now.... [18:42:47] * greg-g reads [18:42:53] binasher: can this go to -wikidata list? [18:42:55] greg-g: ah, i thought that was why you asked :) [18:43:04] ottomata: hrm, dunno. i imagine it's the sort of thing for which there is some kind of best-practice permission scheme (giving non-root daemons the ability to poll system stats on /proc) [18:43:15] ask binasher :P [18:43:19] greg-g: certainly, would you mind fowarding? [18:43:23] will do [18:43:43] yeah, it works for some of the files in /proc/, just not the /proc//fd directory [18:44:07] dr-x------ 2 root root 0 2013-03-28 20:25 /proc/26970/fd [18:44:13] MaxSem: why's it necessary to direct load.php to mobile? [18:44:15] * binasher needs to run to make a lunch meeting.. back later, then i can try to help you ottomata  [18:44:23] mmk, no worries binasher, thanks! [18:44:35] so that it delivers a different module? [18:44:38] what I have works now, its just not as elegant as ori'l's solution [18:44:51] Ryan_Lane, because X-Device is set only for mobile Varnishes [18:45:49] this is put bits content into the mobile varnish, right? [18:45:52] *will [18:46:52] yes, but only mobile load.php requests, which would take very little storage capacity [18:46:58] * Ryan_Lane nods [18:47:39] is it not possible to pass a param to load.php that bits can detect for delivering x-device? [18:47:51] either way we're going to need to reconfigure varnish [18:47:58] for either mobile or bits [18:49:27] !log hot-swapping disk on db29 [18:49:33] I'm just asking, really [18:49:34] Logged the message, Master [18:49:45] Ryan_Lane, eg s/mobile.device.detect/mobile.device.{$x0device}/ ? [18:49:46] my only real concern is troubleshooting [18:50:46] well, and splitting bits config into two spots [18:52:16] New patchset: Ottomata; "Adding class misc::monitoring::net::udp for generic udp statistics monitoring." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56623 [18:52:31] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56623 [18:52:41] MaxSem: you are delivering links to bits from the html. the html is mobile specific [18:53:04] yes... [18:53:24] so, it should be possible to tell it to do mobile detection [18:53:50] tell what? [18:54:26] bleh. let me reorganize my thoughts so I can ask more clearly [18:55:19] detecting device in load.php is easy, but it will not cache peroperly without X-Device header set in frontend varnish [18:55:34] cause it needs to vary by X-Device [18:56:42] does the module delivered need to be different for every device? [18:57:30] hi, could someone review https://gerrit.wikimedia.org/r/#/c/56333/ (a few more acls) and https://gerrit.wikimedia.org/r/#/c/55302/ ? [18:57:53] paravoid, i hope i addressed your questions in the email :) [18:58:16] Ryan_Lane, yes, that's the point of it [18:58:25] (I'm not actually suggesting a different approach, btw, I'm just trying to get a better understanding of how it's changing and why it's necessary) [18:58:39] ah. ok. that makes sense, then [18:59:00] yeah, mobile stuff is insane in places [18:59:04] heh [18:59:14] (in a lot of them) [19:00:10] cool. this change sounds really great, btw [19:00:32] good job tackling this problem [19:03:32] Ryan_Lane, wait until we pile up all the various banners for zero... talking of "insane" :) [19:06:14] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [19:06:55] RECOVERY - RAID on db29 is OK: OK: 1 logical device(s) checked [19:07:24] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [19:07:54] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 26606 MB (3% inode=99%): [19:23:06] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/51797 [19:24:48] !log aaron synchronized php-1.21wmf12/includes 'deployed fa5c0a6a82c9a1cb63f3cedd87e6c4aba59c994d' [19:24:49] New patchset: Demon; "Show notice to users who are using legacy skins" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56408 [19:24:55] Logged the message, Master [19:25:29] New review: Demon; "(5 comments)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56408 [19:25:38] are there any bugzilla admins awake? [19:25:56] I'd like to make a canned query publicly accessible [19:26:20] New patchset: Dzahn; "add transitionteam wiki conf - RT-4850" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/56631 [19:29:28] gwicke: ? [19:29:31] !log creating transitionteam docroot [19:29:38] Logged the message, Master [19:30:02] Reedy: is there a way to make a canned ('saved') query in bugzilla publicly available? [19:30:03] !log dzahn synchronized ./docroot/transitionteam [19:30:05] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/56631 [19:30:09] Logged the message, Master [19:30:14] gwicke: https://bugzilla.wikimedia.org/userprefs.cgi?tab=saved-searches [19:30:26] Share with group bz_canusewhines [19:30:36] Or do you not have that shown? [19:30:50] ah, nice! [19:30:53] thanks for that tip! [19:32:10] dzahn is doing a graceful restart of all apaches [19:32:34] Reedy: it does not seem to be available to ordinary users still, much less publicly [19:32:46] oh well, will have to paste long urls instead I guess [19:32:52] !log dzahn gracefulled all apaches [19:32:59] Logged the message, Master [19:33:28] Reedy: DNS and Apache done for new wiki, incl. docroot [19:33:45] just mkdir and sync-common-file [19:35:33] That's great, thanks [19:37:42] yw [19:47:30] ^demon: http://ganglia.wikimedia.org/latest/?c=Miscellaneous%20pmtpa&h=professor.pmtpa.wmnet&m=cpu_report&r=hour&s=descending&hc=4&mc=2 [19:47:41] packets received & bytes received :) [19:48:22] <^demon> Ha, great. [19:59:22] New patchset: Diederik; "Dropping all search and preview url's" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56633 [20:02:12] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56633 [20:04:18] ori-l, your stuff works great! [20:04:21] :D :D :D [20:04:24] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [20:04:47] pure uncut colombian [20:04:54] haha [20:04:56] lol [20:05:06] ottomata: easy [20:05:16] so, i haven't done too much ganglia stuff, but! [20:05:17] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20eqiad&h=oxygen.wikimedia.org&v=390342&m=SndbufErrors&r=hour&z=default&jr=&js=&st=1364583635&vl=packets&ti=UDP%20Send%20Buffer%20Errors&z=large [20:05:30] do you think delta would be more useful? [20:05:35] send buffer errors / sec? [20:05:40] etc.? [20:06:12] hrm, yeah, that looks slightly misconfigured [20:06:15] probably my fault [20:06:24] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:06:25] well, it just reports the current value [20:06:27] ganglia only shows rate over time [20:06:30] which is correct, so that's ok [20:06:34] oh, hm [20:06:34] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [20:06:53] where's your config (and which script are you using)? [20:07:04] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 26030 MB (3% inode=99%): [20:07:06] its in puppet right now, but its the same as yours was, with added metrics [20:07:13] db11 just doesnt want to die [20:07:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [20:07:16] files/ganglia/plugins/udp_stats.py [20:08:04] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Fri Mar 29 20:07:56 UTC 2013 [20:08:24] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [20:09:14] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Fri Mar 29 20:09:03 UTC 2013 [20:09:23] * ori-l looks [20:09:24] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [20:10:15] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Fri Mar 29 20:10:04 UTC 2013 [20:10:24] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [20:11:04] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Fri Mar 29 20:10:57 UTC 2013 [20:11:24] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [20:11:45] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Fri Mar 29 20:11:43 UTC 2013 [20:12:24] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [20:12:34] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Fri Mar 29 20:12:24 UTC 2013 [20:13:24] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [20:13:52] !log scheduling a 2-year downtime for db11 :p [20:13:59] Logged the message, Master [20:14:44] RECOVERY - Puppet freshness on db11 is OK: puppet ran at Fri Mar 29 20:14:33 UTC 2013 [20:15:10] arr, whatever [20:15:22] ottomata: yep, i need to fix it. just a moment [20:17:33] ooook, cool [20:17:33] danke [20:17:41] check out udp2log_socket.py as well then [20:17:48] i think it has the same need [20:20:24] well, ok, so let's think about this. I think it's not strictly-speaking incorrect. you're right that it just reports the value. but it's not a very good fit for the metric. [20:20:55] because suppose there's some weird udp issue tonight that causes 100k missed packets. your graph will just permanently rise by a 100k [20:21:04] well, until you reboot or whatever [20:21:19] right [20:21:30] New review: Yurik; "(1 comment)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52606 [20:21:33] since these are just counters, i think the delta should just be reported, right? [20:21:40] the ganglia docs are absolutely terrible on this particular point, but I think counter type should be 'positive' [20:21:43] that should be the only fix you need [20:22:02] New patchset: Dzahn; "old key of jgonera: ensure => absent" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56638 [20:23:02] i'll make a commit [20:23:09] i can try real quick to test first, if you want [20:23:12] got it on a labs instance [20:24:03] naw, i'm sure this is right [20:24:13] ok cool [20:24:15] dooo oit [20:24:44] i love the ganglia docs (http://sourceforge.net/apps/trac/ganglia/wiki/ganglia_gmond_python_modules): [20:24:46] "The exact nature of this element is unclear, as is its relationship to the 'collect_every' configuration directive in your pyconf for the module. For all intents and purposes, this element seems... useless. " [20:26:49] New patchset: Ori.livneh; "Change slope to "positive" for UDP metrics." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56639 [20:27:09] haha [20:27:36] hmm, not all are counters though [20:27:42] udp2log ReceiveQueue varies [20:27:55] it should probably be an exception, right? [20:27:56] New patchset: Krinkle; "gerrit: Change repeat from 8-40 to 7-40 for hash in commentlink" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56640 [20:28:11] oh, that's a good point [20:30:13] so let's see [20:30:43] so udp2log_socket.py was actually correct [20:30:48] because it already special-cases it [20:30:54] oh [20:30:56] ok cool [20:31:09] er, wait [20:31:13] nice yeah is ee [20:31:16] drops is positive [20:31:17] that's right [20:31:23] no wait [20:31:34] New review: Dzahn; "RT-4854" [operations/puppet] (production) C: 2; - https://gerrit.wikimedia.org/r/56638 [20:31:35] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56638 [20:31:36] yeha [20:31:37] yeah [20:31:44] rx_queue and tx_queue are not just counters, they vary [20:31:57] they are set to 'both', and drops is special cased to 'positive' [20:32:38] New patchset: Ori.livneh; "Change slope to "positive" for UDP metrics." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56639 [20:35:43] cool, danke, merging [20:35:51] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56639 [20:36:47] fyi: puppet started mongodb on stat1 [20:37:04] hipster puppet [20:37:51] mutante, it was running before, no? [20:39:48] ottomata: hmm, it looked like no: ensure changed 'stopped' to 'running' [20:40:12] i just ran puppet for unrelated thing, ensure an old key is absent and noticed that [20:40:39] ori-l: PBRuppet ? [20:41:09] hm, maybe its one of those where puppet doesn't know how to tell if the proc is running [20:41:40] oh yea, possible [20:42:46] yeah [20:42:50] i ijust ran puppet and it said the same [20:42:56] udp2log does that too :/ [20:43:56] gotcha,ok [20:46:10] yeah, the upstart config is wrong [20:46:37] doesn't know about the pid file specified in /etc/mongodb.conf (which is /var/run/mongodb/mongod.pid) [20:54:18] New patchset: Ottomata; "Re-enabling htaccess on metrics.wikimedia.org" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56691 [20:55:14] ori-l hmm ,seems like positive didn't change anything [20:55:20] http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Miscellaneous%20eqiad&h=oxygen.wikimedia.org&v=390342&m=SndbufErrors&r=hour&z=default&jr=&js=&st=1364583635&vl=packets&ti=UDP%20Send%20Buffer%20Errors&z=large [20:56:23] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56691 [20:56:27] ottomata: give it time. i don't recall specifics but ganglia isn't very graceful about handling metric changes [20:57:08] hm ok [20:57:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:00:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.123 second response time [21:02:03] !log jenkins: raised the number of execution for deployment-bastion from 2 to 4. Will let us execute more jobs in parallel [21:02:09] Logged the message, Master [21:02:59] New patchset: Ori.livneh; "Puppetize supervisor configs for EventLogging" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56692 [21:04:53] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/52606 [21:05:55] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [21:07:05] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [21:07:35] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 25567 MB (3% inode=99%): [21:10:01] New review: PleaseStand; "(1 comment)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56408 [21:10:24] binasher: thanks for merging https://gerrit.wikimedia.org/r/52606! that will find its way into prod sometime in the next 30 minutes or so, correct? [21:13:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:14:16] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [21:20:45] binasher, hi, could you review the other two patches for the same file please? It seems there are no objections to them either. [21:21:00] https://gerrit.wikimedia.org/r/#/c/56333/ (a few more acls) and https://gerrit.wikimedia.org/r/#/c/55302/ [21:21:34] New patchset: Ori.livneh; "upstart: pass '--pid-file' param to udp2log" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56693 [21:21:40] ^ ottomata [21:22:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:23:03] coool! thanks [21:23:08] FYI, init.d != upstart :p [21:23:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [21:24:15] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56693 [21:24:43] er, right [21:24:46] binasher, ok to merge the X-Analytics stuff on sockpuppet? [21:28:21] awjr, the X-Analytics change isn't yet on the puppetmster [21:28:27] i was about ot merge another change and I see it waiting there [21:28:38] I don't want to touch it unless I get a real ok from binasher though :/ [21:28:45] ah ok, thanks for the head's up [21:28:52] i didn't realize there was a separate step there [21:29:10] ottomata: can you ping me if/when that goes out? [21:29:30] sure, it will still take some minutes after it is merged in, for puppet to run on varnishes, so I can't give you the exact timing [21:29:38] for sure [21:29:49] brb [21:30:31] New patchset: awjrichards; "Enable X-Analytics logging in MobileFrontend" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56696 [21:31:10] New patchset: awjrichards; "Enable X-Analytics logging in MobileFrontend" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/56696 [21:32:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:33:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.143 second response time [21:36:38] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/54970 [21:39:17] PROBLEM - Puppet freshness on lvs1004 is CRITICAL: Puppet has not run in the last 10 hours [21:39:17] PROBLEM - Puppet freshness on lvs1005 is CRITICAL: Puppet has not run in the last 10 hours [21:39:17] PROBLEM - Puppet freshness on lvs1006 is CRITICAL: Puppet has not run in the last 10 hours [21:50:39] anybody here know how this was made? [21:50:39] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&tab=v&vn=swift+frontend+proxies [21:50:45] I don't see a view .json file on nickel [21:50:59] so I imagine it was created with ganglia web gui somehow [21:51:01] I want to make one too! [22:05:18] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [22:07:28] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [22:07:58] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 25137 MB (3% inode=99%): [22:10:01] New patchset: awjrichards; "Support client-side caching for mobile" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56701 [22:11:08] PROBLEM - Puppet freshness on cp3010 is CRITICAL: Puppet has not run in the last 10 hours [22:18:36] heya, Ryan_Lane, where is ganglia.wmflabs.org hosted? [22:18:47] in the ganglia project [22:18:51] on aggregator1 [22:19:13] coool, i'm working on a ganglia::view define to more easily create ganglia views [22:19:18] aggregator2 is obviously the 2nd one :) [22:19:37] ganglia views? [22:19:38] hm, can you add me to the ganglia project/ [22:19:40] yeah, like [22:19:45] yeah [22:19:45] http://ganglia.wikimedia.org/latest/?r=hour&cs=&ce=&tab=v&vn=swift+frontend+proxies [22:20:02] note that aggregator1 and aggregator2 are now too small and occasionally OOM [22:20:08] its basically just json defining which hosts and metrics to show [22:20:10] hmm, k [22:20:15] ah. cool [22:20:18] http://sourceforge.net/apps/trac/ganglia/wiki/ganglia-web-2#Views [22:20:20] that's awesome [22:22:38] ja cool, just wanna test it in labs first, easier if I can test it on a running ganglia web instance, lemme know when i'm in the project [22:22:59] New review: MaxSem; "This overrides explicit no cache instructions from MW when it's really needed, e.g. on special pages." [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/56701 [22:23:08] PROBLEM - Puppet freshness on virt1005 is CRITICAL: Puppet has not run in the last 10 hours [22:23:26] what's your username? [22:23:37] ah Ottomata [22:23:45] yup [22:24:08] done [22:24:34] yay danke [22:25:37] !log reedy synchronized php-1.21wmf12/extensions/Wikibase [22:25:44] Logged the message, Master [22:33:55] Change abandoned: awjrichards; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56701 [22:36:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:36:53] Change abandoned: Jeremyb; "I2a9fbe5f7522ba9fed64415b5f7b230ee50cfc23 was merged instead" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56604 [22:37:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [22:38:25] !log upgrading packages on aluminium [22:38:32] Logged the message, Mistress of the network gear. [22:52:05] no krinkle, no hashar :( [22:56:44] ganglia docs are so bad! [22:58:47] New patchset: QChris; "Reference used images by absolute path in gerrit's css" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56705 [22:59:12] New patchset: QChris; "Style gerrit's LDAP login page according to other gerrit pages" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/56706 [23:06:11] PROBLEM - Puppet freshness on db11 is CRITICAL: Puppet has not run in the last 10 hours [23:07:21] PROBLEM - RAID on db11 is CRITICAL: CRITICAL: Defunct disk drive count: 1 [23:07:51] PROBLEM - Disk space on db11 is CRITICAL: DISK CRITICAL - free space: /a 25704 MB (3% inode=99%): [23:10:31] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:11:21] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.129 second response time [23:16:49] New review: QChris; "(1 comment)" [operations/puppet] (production) C: -1; - https://gerrit.wikimedia.org/r/56640 [23:36:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:37:18] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.133 second response time [23:37:44] New review: Brion VIBBER; "Ok, this version looks fine to me. Faidon etc, what say you?" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/55302 [23:44:41] hrmm, zirconium has a higher load on the hour for its updates [23:44:50] i wonder if it could handle being our irc server for eqiad. [23:45:02] i guess i can do the includes and just not change dns and see what load is [23:45:18] (on monday, pushing anything this late on friday is horrible idea) [23:45:36] better than pushing it *after* we start drinking [23:45:39] it also technically should be moved into a role rather than just a misc class i suppsoe [23:45:50] i would have started, there are no proper tumblers for scotch [23:45:59] im a civilized man damn it. [23:46:18] drinking booze out of coffee cups is for the plebes! [23:46:31] we should bring in some proper glasses [23:49:32] brion: http://www.amazon.com/Sagaform-Rocking-Whiskey-Glasses-4-Ounces/dp/B001JANQRY/ref=sr_1_4?ie=UTF8&qid=1364600951&sr=8-4&keywords=scotch+tumblers [23:51:03] !log stopping mailman to remove sensitive message [23:51:09] Logged the message, Mistress of the network gear. [23:54:24] oh lord, removing messages from mailman [23:54:31] we have GOT to replace mailman with something from this century [23:56:48] brion: NEVA [23:56:54] mailman 4 life yo [23:57:08] we are also going to migrate away from google apps [23:57:20] and back to sendmail [23:57:21] and dns back to bind? [23:57:21] good old sendmail. [23:57:29] nah, who needs FQDN [23:57:50] we'll just use IP addresses