[00:15:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:29:04] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.712 seconds [01:02:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:04:35] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [01:15:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.550 seconds [01:42:14] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 275 seconds [01:45:37] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 17 seconds [01:52:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:00:35] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 283 seconds [02:05:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.019 seconds [02:28:01] !log LocalisationUpdate completed (1.21wmf3) at Mon Nov 5 02:28:01 UTC 2012 [02:28:09] Logged the message, Master [02:38:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:39:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.026 seconds [02:51:13] !log LocalisationUpdate completed (1.21wmf2) at Mon Nov 5 02:51:13 UTC 2012 [02:51:22] Logged the message, Master [03:59:07] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 6 seconds [04:45:19] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [06:12:12] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [06:12:12] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [06:12:12] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [06:31:17] New review: jan; "You are right: Labsconsole does not include low-level-classes but it includes high-level-classes lik..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/29975 [06:44:56] New patchset: Jalexander; "switching wikidatawiki to noticeproject:wikimedia so that it doesn't get Wikipedia banners (like editor survey). Temp until new Wikidata category being added to extension soon" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31812 [07:23:21] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [07:44:54] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.004 second response time on port 11000 [07:55:08] New review: Hydriz; "Look good (and a nice idea too)." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/31812 [07:56:37] hello [07:57:13] hi hashar [07:57:56] oh men, operations/mediawiki-config has a ton of pending changes :-D [07:58:04] guess I will deploy a bunch of them this morning [07:58:39] thanks :) [07:59:02] hashar: If you have time (and you can), do look at https://gerrit.wikimedia.org/r/#/q/status:open+project:mediawiki/extensions/WikimediaMaintenance,n,z [07:59:02] I sent a patch to do some unit testing [07:59:17] but including commonsettings.php several time does not play well :( [08:03:14] New patchset: Hydriz; "(bug 41774) New wgImportSource for ro.wikibooks" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31813 [08:04:48] and gerrit is giving 503 :( [08:15:21] New patchset: Hydriz; "(bug 41774) New wgImportSource for ro.wikibooks" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31813 [08:22:12] New patchset: Hashar; "(bug 40717) Namespace configuration for th.wiktionary and th.wikibooks" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26449 [08:23:18] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [08:23:30] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26449 [08:24:31] !log hashar synchronized wmf-config/InitialiseSettings.php '(bug 40717) Namespace configuration for th.wiktionary and th.wikibooks' [08:24:39] Logged the message, Master [08:25:21] !log hashar synchronized wmf-config/InitialiseSettings.php '(bug 40717) Namespace configuration for th.wiktionary and th.wikibooks' [08:25:26] Logged the message, Master [08:38:52] New patchset: Hashar; "(bug 41167) Namespace configuration for ba.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28505 [08:39:20] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28505 [08:40:06] !log hashar synchronized wmf-config/InitialiseSettings.php '(bug 41167) Namespace configuration for ba.wikipedia' [08:40:12] Logged the message, Master [08:45:21] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31232 [08:46:20] !log hashar synchronized wmf-config/InitialiseSettings.php '(bug 41467) Create new namespace alias WP: for Norwegian (bokmål) Wikipedia' [08:51:22] New patchset: Hashar; "(bug 32411) Transwiki import to multilingual wikisource broken" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31562 [08:52:35] New review: Hashar; "We could probably enable any sourcewiki. Anyway, lets try that change." [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/31562 [08:52:35] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31562 [08:53:12] !log hashar synchronized wmf-config/InitialiseSettings.php '(bug 32411) Transwiki import to multilingual wikisource broken' [08:53:20] Logged the message, Master [08:56:03] New patchset: Hashar; "(bug 40212) Mass update Wiktionary favicons" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31570 [08:56:59] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31570 [08:57:33] !log hashar synchronized wmf-config/InitialiseSettings.php '(bug 40212) Mass update Wiktionary favicons' [08:57:41] Logged the message, Master [08:59:27] New review: Hashar; "I would prefer we stop doing those "I fix whitespaces" changes. That is a lot of overhead just for a..." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/31205 [08:59:30] New patchset: Hashar; "Space attack, reduce. See I3aa4e3a3" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31205 [09:00:07] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31205 [09:00:43] !log hashar synchronized wmf-config/InitialiseSettings.php 'whitespace fix {{gerrit|31205}}' [09:00:50] Logged the message, Master [09:01:19] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31608 [09:01:59] !log hashar synchronized wmf-config/InitialiseSettings.php 'Configure Babel category for Wikidata {{gerrit|31608}}' [09:02:05] Logged the message, Master [09:02:37] New review: Hashar; "Deployed on live cluster." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31608 [09:07:43] New patchset: Hashar; "(bug 40212) Mass update Wiktionary favicons" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31584 [09:13:16] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31584 [09:14:33] New review: Hashar; "deployed live." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31584 [09:15:38] !log hashar synchronized wmf-config/InitialiseSettings.php '(bug 40212) Mass update Wiktionary favicons' [09:15:40] Logged the message, Master [09:16:36] New patchset: Hashar; "(bug 38134) Enable Extension:GoogleNewsSitemap on es wikinews" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30589 [09:17:09] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30589 [09:17:52] !log hashar synchronized wmf-config/InitialiseSettings.php '(bug 38134) Enable Extension:GoogleNewsSitemap on es wikinews' [09:17:57] Logged the message, Master [09:21:33] Hydriz: you rocks :-] [09:21:40] ? [09:21:54] Hydriz: Enable Extension:GoogleNewsSitemap on es wikinews - https://bugzilla.wikimedia.org/38134 [09:21:57] Hydriz: isn't that you ? [09:22:08] yep :) [09:22:11] but what about it? [09:22:31] well you did a lot of changes [09:22:38] much appreciated :-] [09:22:42] lolz [09:22:46] I love how the community is kind of self healing the cluster ;-] [09:22:58] * Hydriz calms his heart down after the jump [09:23:06] :P [09:24:26] PROBLEM - Puppet freshness on ms-fe3 is CRITICAL: Puppet has not run in the last 10 hours [09:25:32] New patchset: Hashar; "(bug 41774) New wgImportSource for ro.wikibooks" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31813 [09:25:56] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31813 [09:26:22] !log hashar synchronized wmf-config/InitialiseSettings.php '(bug 41774) New wgImportSource for ro.wikibooks' [09:26:25] merged in a hour, thanks :) [09:26:30] Logged the message, Master [09:28:12] Hydriz: monday is my 20% day [09:28:26] Hydriz: and I usually spend the morning merging and deploying mediawiki-config changes [09:28:26] o_O thats good [09:35:02] New patchset: Hashar; "(bug 41717) Update default (language-neutral) Wiktionary logo" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31585 [09:35:44] New review: Hashar; "rebased and fixed conflict." [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/31585 [09:35:44] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31585 [09:36:15] !log hashar synchronized wmf-config/InitialiseSettings.php '(bug 41717) Update default (language-neutral) Wiktionary logo' [09:36:20] Logged the message, Master [09:40:08] now this may be language neutral, but its kinda funny: https://be.wiktionary.org/wiki/%D0%9F%D0%B5%D1%80%D1%88%D0%B0%D1%8F_%D1%81%D1%82%D0%B0%D1%80%D0%BE%D0%BD%D0%BA%D0%B0 [09:43:03] New review: Hashar; "Some minor questions." [operations/mediawiki-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/31580 [09:44:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:45:03] New review: Hashar; "Commented on bug 41712 as well." [operations/mediawiki-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/31580 [09:45:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.235 seconds [09:48:34] New patchset: Hashar; "wikidatawiki to use noticeproject::wikimedia banner" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31812 [09:48:56] New review: Hashar; "I have rephrased the commit message and rebased the changed." [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/31812 [09:48:56] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31812 [09:49:54] !log hashar synchronized wmf-config/InitialiseSettings.php 'wikidatawiki to use noticeproject::wikimedia banner' [09:49:58] Logged the message, Master [09:54:40] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [09:54:40] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [09:54:40] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [09:57:23] off for a few minutes [10:09:18] New review: Nikerabbit; "It would take less time if bad whitespace were not merged in the first place thus requiring followup..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31205 [10:20:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:25:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.804 seconds [10:26:25] paravoid: hi, could you log into tmh1 and pastebin or email me ps axu, am trying to figure out why no jobs are processed [10:28:21] ps axu | grep apache might be enough to see if jobs-loop.sh is running and what state it is in [10:37:24] j^: [10:37:25] apache 7194 0.0 0.0 13440 5380 ? SN Oct23 2:29 /bin/bash /usr/local/bin/jobs-loop.sh -t 14400 -v 0 webVideoTranscode [10:37:25] apache 23345 0.0 0.0 4308 356 ? SN 10:37 0:00 sleep 5 [10:37:29] only two lines of apache [10:38:52] /usr/local/apache/common/multiversion is empty [10:39:36] there should be MWScript.php and other things in it [10:40:05] this is why jobs-loops.sh fails; it can never get a next db since it doesn't find MWScript.php [10:47:49] apergos: do you know what puppet class creates /usr/local/apache/common/multiversion [10:48:13] I believe there is a special sync script that populates it [10:48:36] can you run it? [10:48:48] let me see if I can find it [10:49:58] ohh I can sync-dir, but can I do that for one host only? mm [10:50:06] nope. [10:50:13] * apergos does it the hard way [10:53:09] oh heh [10:53:16] /usr/local/apache/common is empty :-D [10:54:34] on labs /usr/local/apache/common -> /usr/local/apache/common-local [10:54:56] while all of /usr/local/apache is a symlink somewhere else (/data/project/apache) [10:55:20] not sure how far this should/is the same in production [10:55:37] tmh1/2 should have the same layout as jobrunners have [10:55:48] yes, well the point is [10:55:50] lrwxrwxrwx 1 root root 12 Oct 3 01:03 common -> common-local [10:56:01] but this dir is empty instead of having nice yummy mw installations in it [10:56:04] I'm doing the rsync now [10:56:45] the jobrunners are like any other apache box, same setup [10:57:04] (keeps it simpler to have one generic setup everywhere) [10:58:04] if tmh1 is not in the mw installation hosts for dsh you will want to add it, and also to the job runners group. (I wonder if those are in puppet yet.) [10:58:11] sounds good, just wondering why it was missing on tmh1/2 and how to make sure updates also get synced [10:58:30] ah do I need to do tmh2 also? [10:58:55] they have the same setup, so if tmh1 is broken i assume tmh2 is broken too [10:59:00] tmh2 is even worse off [10:59:20] * j^ looks how to add tmh1/2 to mw installation hosts for dsh [11:00:37] * apergos adds the symlink for common-local to tmh2 and does the same rsync here [11:00:37] *there [11:01:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:02:47] with all files in place /etc/init.d/mw-job-runner restart is needed, cd happens before the loop [11:03:03] yes indeed [11:03:08] well I am waiting for the rsyncs to finish [11:03:47] after that I'll try getting a next db from the command line and running a job [11:03:52] if that works then I'll restart the script [11:04:15] not able to find what needs to be done to have this happen automatically next time, should i file a bug in rt about it? [11:04:43] (these take a while because they copy over all the branches of mw that we have on fenari which is about 4 right now) [11:04:57] sure, if you file it that will highlight the problem we have keeping those files up to date [11:05:41] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [11:09:14] ok rt ticket at https://rt.wikimedia.org/Ticket/Display.html?id=3861 [11:09:26] great, thnk you [11:09:28] *thank [11:12:06] I see transcode jobs running now [11:12:10] tmh1 [11:12:47] same on tmh2 [11:13:02] I guess you are good to go for a little while [11:13:09] nice, thanks a lot [11:13:20] yw [11:14:46] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.393 seconds [11:35:01] PROBLEM - Frontend Squid HTTP on amssq54 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:37:52] ah guess I should log all that [11:38:10] RECOVERY - Frontend Squid HTTP on amssq54 is OK: HTTP OK HTTP/1.0 200 OK - 795 bytes in 0.235 seconds [11:38:11] !log synced apache/common-local by hand to tmh1 and tmh2 and restart job runners on both hosts (the directory was previously empty) [11:38:16] Logged the message, Master [11:49:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:58:11] PROBLEM - Frontend Squid HTTP on amssq54 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:59:39] RECOVERY - Frontend Squid HTTP on amssq54 is OK: HTTP OK HTTP/1.0 200 OK - 795 bytes in 0.243 seconds [12:01:07] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.476 seconds [12:08:30] New patchset: Mark Bergsma; "Increase the frontend cache size on servers with a lot of memory" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31819 [12:09:07] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31819 [12:18:28] New patchset: Mark Bergsma; "Apparently we don't have to_bytes() available" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31821 [12:18:45] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31821 [12:35:12] PROBLEM - Puppet freshness on erzurumi is CRITICAL: Puppet has not run in the last 10 hours [12:36:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:38:13] !log Performance testing cp3003 [12:38:18] Logged the message, Master [12:45:15] New patchset: Hydriz; "(bug 41757) Enable special:import on Hindi Wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31823 [12:46:14] PROBLEM - Frontend Squid HTTP on amssq57 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:49:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.416 seconds [12:51:02] RECOVERY - Frontend Squid HTTP on amssq57 is OK: HTTP OK HTTP/1.0 200 OK - 656 bytes in 1.179 seconds [12:56:12] PROBLEM - Frontend Squid HTTP on amssq57 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:59:20] RECOVERY - Frontend Squid HTTP on amssq57 is OK: HTTP OK HTTP/1.0 200 OK - 656 bytes in 0.235 seconds [13:22:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.031 seconds [13:43:20] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:44:53] RECOVERY - Apache HTTP on mw36 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.162 second response time [13:57:21] New review: Dereckson; "shellpolicy" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/31823 [13:58:11] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:58:11] PROBLEM - Apache HTTP on mw32 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:59:41] RECOVERY - Apache HTTP on mw32 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.959 second response time [13:59:41] RECOVERY - Apache HTTP on mw36 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.781 second response time [14:02:23] PROBLEM - Apache HTTP on mw42 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:02:23] PROBLEM - Apache HTTP on mw27 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:57] RECOVERY - Apache HTTP on mw27 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.045 second response time [14:03:57] RECOVERY - Apache HTTP on mw42 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.177 second response time [14:04:47] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:06:26] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.567 second response time [14:10:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:23:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.274 seconds [14:28:31] New review: Andrew Bogott; "I merged https://gerrit.wikimedia.org/r/#/c/31252/ so it should be possible to refactor this as a me..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/30593 [14:29:35] PROBLEM - Apache HTTP on mw34 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:31:11] RECOVERY - Apache HTTP on mw34 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.804 second response time [14:31:12] PROBLEM - Apache HTTP on mw38 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:32:41] RECOVERY - Apache HTTP on mw38 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.052 second response time [14:46:47] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [14:58:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:02:32] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:04:02] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.728 second response time [15:09:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.002 seconds [15:34:33] PROBLEM - Apache HTTP on mw27 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:35:40] RECOVERY - Host sq67 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [15:36:03] RECOVERY - Apache HTTP on mw27 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.262 second response time [15:39:25] PROBLEM - Host sq67 is DOWN: PING CRITICAL - Packet loss = 100% [15:42:34] RECOVERY - Host sq67 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [15:44:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:49:28] PROBLEM - Host sq67 is DOWN: PING CRITICAL - Packet loss = 100% [15:55:55] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.930 seconds [15:59:30] RECOVERY - Varnish HTTP bits on sq67 is OK: HTTP OK HTTP/1.1 200 OK - 632 bytes in 0.005 seconds [15:59:31] RECOVERY - Host sq67 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [16:05:14] New patchset: Reedy; "enwiki to 1.21wmf3" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31840 [16:12:14] New patchset: Pyoungmeister; "re-adding my contact to nagios" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31843 [16:13:25] glhf notpeter [16:13:37] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [16:13:37] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [16:13:37] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [16:14:59] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31843 [16:16:28] PROBLEM - Host sq68 is DOWN: PING CRITICAL - Packet loss = 100% [16:17:33] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31670 [16:17:49] !log Reinstalling sq68 with Precise [16:17:58] Logged the message, Master [16:22:25] RECOVERY - Host sq68 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [16:25:42] PROBLEM - Varnish HTTP bits on sq68 is CRITICAL: Connection refused [16:31:06] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:42:52] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.045 seconds [16:46:03] PROBLEM - NTP on sq68 is CRITICAL: NTP CRITICAL: No response from NTP server [16:49:03] PROBLEM - Host sq68 is DOWN: PING CRITICAL - Packet loss = 100% [16:51:41] RECOVERY - Host sq68 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [16:52:03] RECOVERY - Varnish HTTP bits on sq68 is OK: HTTP OK HTTP/1.1 200 OK - 630 bytes in 0.002 seconds [16:57:05] !log Reinstalling sq69 [16:57:10] Logged the message, Master [17:03:56] PROBLEM - Varnish HTTP bits on sq69 is CRITICAL: Connection refused [17:04:39] PROBLEM - SSH on sq69 is CRITICAL: Connection refused [17:07:57] RECOVERY - SSH on sq69 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [17:13:13] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 492344 seconds [17:13:30] PROBLEM - Host sq69 is DOWN: PING CRITICAL - Packet loss = 100% [17:18:12] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:19:40] !log DNS update - push wikivoyage.de (link) [17:19:51] Logged the message, Master [17:20:23] !log gracefulling all apaches to pick up https://gerrit.wikimedia.org/r/#/c/31670/ [17:20:28] Logged the message, notpeter [17:20:58] py is doing a graceful restart of all apaches [17:21:12] !log py gracefulled all apaches [17:21:16] Logged the message, Master [17:23:19] paravoid: As a result of that long email thread a month ago, I have in mind that I should set up a swift dev box to work on error handling and/or container sync. Do you think it actually makes sense for me to do that? I've sort of lost track of what our plan is for swift. [17:31:20] andrewbogott: i have a swift/mediawiki in one virtualbox script at https://github.com/bit/mediawiki_vm its not exactly the setup as in production but helped me to debug/test swift related issues [17:31:43] j^: Cool, that'd be a good place to start. Presuming I want to start :) [17:33:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.033 seconds [17:33:46] !log reedy synchronized php-1.21wmf3/extensions/MwEmbedSupport [17:33:46] Logged the message, Master [17:35:10] !log reedy synchronized php-1.21wmf3/extensions/TimedMediaHandler/ [17:35:16] Logged the message, Master [17:42:42] andrewbogott: no, I don't think we need anything there anymore [17:43:08] paravoid: How so? Are we abandoning swift? [17:43:40] no, there's no such plan yet [17:44:21] but container sync does not apply well into our problem and the swift people are preparing a different feature that may suit us better [17:44:46] they're already doing that work, I'm not sure if it makes sense to help them coding that [17:44:51] Ah, are they adding a feature to do actual multi-data-center replication? [17:45:03] yes [17:45:07] http://swiftstack.com/blog/2012/09/16/globally-distributed-openstack-swift-cluster/ [17:45:28] OK, that seems much better :) [17:45:52] so, yeah, they're already in the process of writing the code for that [17:45:52] How are we going to manage the datacenter migration in the meantime? Or do we hope to have that feature soon? [17:46:16] well [17:46:24] we do hope that we'll have that soon [17:46:30] and we have the netapps in place to actually have a copy for DR reasons [17:46:49] whether we'll use it as part of the eqiad migration remains to be seen [17:47:13] we'll probably going to leave upload to tampa while we migrate the rest to eqiad, but nothing's set in stone [17:47:16] OK. I was wondering why we can't just ship duplicate drives and then use container sync to correct the diff that appeared in the meantime... [17:47:30] no, it's much more complicated than that [17:47:52] * andrewbogott sort of wants to know and sort of doesn't. [17:48:07] container sync needs some sync points to be able to sync [17:48:12] if these aren't set, it's just going to replicate everything again [17:48:24] that's one of the problems with that approach [17:48:31] it doesn't work like rsync at all [17:48:40] (unfortunately) [17:48:43] Oh, OK. So I guess I'm presuming that container sync isnt' stupid :( [17:48:57] also, the way cont-sync is designed, I don't think it'll be able to keep up with the new files anyawy [17:48:57] it was dog slow [17:49:22] I also don't trust it enough to run in production with our critical data [17:49:47] it wasn't very well designed [17:49:57] the latest bug that I found was that once you set up sync for a container [17:50:03] you can never ever tear it down again :-) [17:50:34] I submitted it as a bug and they fixed it -- apparently a two-line fix [17:50:34] but still, kind of shows its quality [17:50:42] not by itself, but combined with all the other bugs I mailed about [17:51:27] Reedy: so... j^ opened a bug for us, about how MW is not getting synced to tmh1/tmh2 [17:51:40] is it just a matter of adding it to the dsh groups? [17:53:21] Yup, though notpeter has fixed it already :) [17:53:38] oh did he [17:53:40] * andrewbogott crosses 'swift' off of to-do list [17:53:54] crap :) [17:53:54] sorry for the noise [17:54:12] heh, it's fine [17:54:27] andrewbogott: well, if you have "spare" time, I guess you could interface with the swiftstack people about the region support [17:54:34] not sure if it makes much sense from WMF's perspective [17:54:43] not my call anyway :) [17:55:32] my opinion would be that having both of us not working into labs would be bad for the project [17:56:21] !log reedy synchronized php-1.21wmf2/extensions/MwEmbedSupport/ [17:56:21] Logged the message, Master [17:57:32] !log reedy synchronized php-1.21wmf2/extensions/TimedMediaHandler/ [17:57:37] Logged the message, Master [17:58:18] !log reedy synchronized php-1.21wmf3/extensions/TimedMediaHandler/ [17:58:19] Logged the message, Master [18:05:07] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:14:24] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [18:18:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.082 seconds [18:18:51] New patchset: Aaron Schulz; "Enabled TMH for all wikis except commons." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31863 [18:19:21] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.003 second response time on port 11000 [18:19:48] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31863 [18:22:01] !log aaron synchronized wmf-config/InitialiseSettings.php 'TMH deploy too all except commons.' [18:22:04] Logged the message, Master [18:23:41] Awesome, say mutante is it, and then leave :D [18:24:30] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [18:24:47] New patchset: Pyoungmeister; "Reduce scaler MaxClients to 18" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31865 [18:29:41] <^demon> Reedy: Delegate then flee? [18:30:12] Yup [18:35:16] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31865 [18:37:37] how hard would it be to get some queries ran on db9 / db10 ? [18:38:56] <^demon> Depends on the queries :) [18:39:42] showing the create tables and the row counts :P [18:39:59] more intimate queries could appear later [18:40:13] <^demon> On what database? [18:40:45] otrs one, not sure about the exact database name [18:40:49] although it's probably 'otrs' [18:41:09] <^demon> I thought otrs was moved off db9/10. [18:41:26] maybe, I got that info from wikitech, so... [18:44:38] sigh, there's a mention in the SAL from 2008 [18:44:43] Srv38 claims to be otrs db server, but that's from 2006 [18:45:35] <^demon> I'm pretty sure it's not srv38. [18:45:43] <^demon> :) [18:45:43] <^demon> binasher may know. [18:46:31] the original information (db9 & db10) turns out to be from 2009 - https://wikitech.wikimedia.org/index.php?title=OTRS&diff=18337&oldid=14622 [18:47:03] <^demon> I could've sworn it was moved. Anyway, binasher would be the guy to ask. [18:47:03] yeah, I'm pretty sure it got moved [18:47:05] cause it was too big [18:47:06] Platonides: db48 is the master, i'll update the wiki page [18:47:11] and db9 didn't have enough space [18:47:19] I remember something about moving, but not where [18:47:43] wiki updated [18:48:24] so.. can those queries be run? [18:49:17] what queries? Jeff_Green usually handles otrs support, tho he may be busy with fr [18:50:06] binasher: meh. that's a fallacy that needs to be slain [18:50:12] hahah [18:50:19] I just want to figure out the layout and whether we are actually using those tables or not [18:50:35] nobody takes care of poor otrs :P [18:50:42] these queries http://pastebin.com/hp6Kr5CP [18:51:08] Jeff_Green: it will live on as a zombie fallacy after slaying! [18:51:30] "Jeff helped one person once with OTRS, now he's the sole supporter for the entire org!" [18:51:57] Jeff_Green: hey, that's how I became our DBA ;) [18:51:59] ha [18:52:02] <^demon> That's what you get for helping people ;-) [18:52:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:52:46] that looks cut and pasteable tho.. Platonides i can run those for you [18:53:01] thanks [18:53:07] * Jeff_Green is gonna add little johnny drop tables... [18:53:18] I used explain on the selects that looked slow [18:53:34] you may still want to run it on the slave, though [18:53:50] which is probably idle most of time [18:53:50] (ie. just replicating) [18:53:59] Platonides: where do you want the output? [18:54:23] do you have a fenari account? [18:55:34] while we're at it, would be lovely to get a tarball of the OTRS code. because no one seems to know exactly what version from CVS the quilt is based off of [18:56:18] binasher, I don't [18:56:34] although I wouldn't complain if you enabled one while you are at it xD [18:57:06] you could use labs bastion instead [18:57:06] Platonides: slave and master are not black and white here... [18:57:07] jeremyb, +1 [18:57:55] jeremyb, I tried to guess from the last svn date of our patches, for extracting the db layout [18:58:04] Platonides: k [18:58:32] Platonides: have you read the upgrade instructions to get to a more recent version? they made me cry [18:58:33] db48 and db49 are missing from [[Server_roles]] btw [18:58:49] wtf is [[Server_roles]] [18:59:01] https://wikitech.wikimedia.org/view/Server_roles [18:59:20] jeremyb, at least there are instructionf for upgrading [18:59:27] Platonides: yeah! [18:59:57] that they were last used by Tim in 2009 is "unrelated" [19:00:07] PROBLEM - Packetloss_Average on oxygen is CRITICAL: CRITICAL: packet_loss_average is 26.5414620833 (gt 8.0) [19:00:11] or that nobody tried to follow them in the last 3 years... [19:01:16] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31840 [19:02:08] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: enwiki to 1.21wmf3 [19:02:18] Logged the message, Master [19:07:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.024 seconds [19:20:43] AaronSchulz: which projects should we enable pecl-memcached on this afternoon? [19:21:44] binasher: maybe zh -> de -> more [19:21:57] binasher: would you see a reason not to do all of them if it goes well? [19:22:13] or maybe wait a day for that [19:22:29] AaronSchulz: nope, i'm looking forward to doing all of them, and disabling multiwrite [19:22:35] waiting a day might be smart though [19:22:50] yeah, but beyond that I don't see much reason to drag it out [19:23:17]