[00:15:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:29:04] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.712 seconds [01:02:38] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:04:35] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [01:15:50] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.550 seconds [01:42:14] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 275 seconds [01:45:37] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 17 seconds [01:52:17] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:00:35] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 283 seconds [02:05:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.019 seconds [02:28:01] !log LocalisationUpdate completed (1.21wmf3) at Mon Nov 5 02:28:01 UTC 2012 [02:28:09] Logged the message, Master [02:38:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:39:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.026 seconds [02:51:13] !log LocalisationUpdate completed (1.21wmf2) at Mon Nov 5 02:51:13 UTC 2012 [02:51:22] Logged the message, Master [03:59:07] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 6 seconds [04:45:19] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [06:12:12] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [06:12:12] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [06:12:12] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [06:31:17] New review: jan; "You are right: Labsconsole does not include low-level-classes but it includes high-level-classes lik..." [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/29975 [06:44:56] New patchset: Jalexander; "switching wikidatawiki to noticeproject:wikimedia so that it doesn't get Wikipedia banners (like editor survey). Temp until new Wikidata category being added to extension soon" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31812 [07:23:21] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [07:44:54] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.004 second response time on port 11000 [07:55:08] New review: Hydriz; "Look good (and a nice idea too)." [operations/mediawiki-config] (master) C: 1; - https://gerrit.wikimedia.org/r/31812 [07:56:37] hello [07:57:13] hi hashar [07:57:56] oh men, operations/mediawiki-config has a ton of pending changes :-D [07:58:04] guess I will deploy a bunch of them this morning [07:58:39] thanks :) [07:59:02] hashar: If you have time (and you can), do look at https://gerrit.wikimedia.org/r/#/q/status:open+project:mediawiki/extensions/WikimediaMaintenance,n,z [07:59:02] I sent a patch to do some unit testing [07:59:17] but including commonsettings.php several time does not play well :( [08:03:14] New patchset: Hydriz; "(bug 41774) New wgImportSource for ro.wikibooks" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31813 [08:04:48] and gerrit is giving 503 :( [08:15:21] New patchset: Hydriz; "(bug 41774) New wgImportSource for ro.wikibooks" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31813 [08:22:12] New patchset: Hashar; "(bug 40717) Namespace configuration for th.wiktionary and th.wikibooks" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26449 [08:23:18] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [08:23:30] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/26449 [08:24:31] !log hashar synchronized wmf-config/InitialiseSettings.php '(bug 40717) Namespace configuration for th.wiktionary and th.wikibooks' [08:24:39] Logged the message, Master [08:25:21] !log hashar synchronized wmf-config/InitialiseSettings.php '(bug 40717) Namespace configuration for th.wiktionary and th.wikibooks' [08:25:26] Logged the message, Master [08:38:52] New patchset: Hashar; "(bug 41167) Namespace configuration for ba.wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28505 [08:39:20] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/28505 [08:40:06] !log hashar synchronized wmf-config/InitialiseSettings.php '(bug 41167) Namespace configuration for ba.wikipedia' [08:40:12] Logged the message, Master [08:45:21] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31232 [08:46:20] !log hashar synchronized wmf-config/InitialiseSettings.php '(bug 41467) Create new namespace alias WP: for Norwegian (bokmål) Wikipedia' [08:51:22] New patchset: Hashar; "(bug 32411) Transwiki import to multilingual wikisource broken" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31562 [08:52:35] New review: Hashar; "We could probably enable any sourcewiki. Anyway, lets try that change." [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/31562 [08:52:35] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31562 [08:53:12] !log hashar synchronized wmf-config/InitialiseSettings.php '(bug 32411) Transwiki import to multilingual wikisource broken' [08:53:20] Logged the message, Master [08:56:03] New patchset: Hashar; "(bug 40212) Mass update Wiktionary favicons" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31570 [08:56:59] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31570 [08:57:33] !log hashar synchronized wmf-config/InitialiseSettings.php '(bug 40212) Mass update Wiktionary favicons' [08:57:41] Logged the message, Master [08:59:27] New review: Hashar; "I would prefer we stop doing those "I fix whitespaces" changes. That is a lot of overhead just for a..." [operations/mediawiki-config] (master); V: 0 C: 0; - https://gerrit.wikimedia.org/r/31205 [08:59:30] New patchset: Hashar; "Space attack, reduce. See I3aa4e3a3" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31205 [09:00:07] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31205 [09:00:43] !log hashar synchronized wmf-config/InitialiseSettings.php 'whitespace fix {{gerrit|31205}}' [09:00:50] Logged the message, Master [09:01:19] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31608 [09:01:59] !log hashar synchronized wmf-config/InitialiseSettings.php 'Configure Babel category for Wikidata {{gerrit|31608}}' [09:02:05] Logged the message, Master [09:02:37] New review: Hashar; "Deployed on live cluster." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31608 [09:07:43] New patchset: Hashar; "(bug 40212) Mass update Wiktionary favicons" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31584 [09:13:16] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31584 [09:14:33] New review: Hashar; "deployed live." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31584 [09:15:38] !log hashar synchronized wmf-config/InitialiseSettings.php '(bug 40212) Mass update Wiktionary favicons' [09:15:40] Logged the message, Master [09:16:36] New patchset: Hashar; "(bug 38134) Enable Extension:GoogleNewsSitemap on es wikinews" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30589 [09:17:09] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/30589 [09:17:52] !log hashar synchronized wmf-config/InitialiseSettings.php '(bug 38134) Enable Extension:GoogleNewsSitemap on es wikinews' [09:17:57] Logged the message, Master [09:21:33] Hydriz: you rocks :-] [09:21:40] ? [09:21:54] Hydriz: Enable Extension:GoogleNewsSitemap on es wikinews - https://bugzilla.wikimedia.org/38134 [09:21:57] Hydriz: isn't that you ? [09:22:08] yep :) [09:22:11] but what about it? [09:22:31] well you did a lot of changes [09:22:38] much appreciated :-] [09:22:42] lolz [09:22:46] I love how the community is kind of self healing the cluster ;-] [09:22:58] * Hydriz calms his heart down after the jump [09:23:06] :P [09:24:26] PROBLEM - Puppet freshness on ms-fe3 is CRITICAL: Puppet has not run in the last 10 hours [09:25:32] New patchset: Hashar; "(bug 41774) New wgImportSource for ro.wikibooks" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31813 [09:25:56] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31813 [09:26:22] !log hashar synchronized wmf-config/InitialiseSettings.php '(bug 41774) New wgImportSource for ro.wikibooks' [09:26:25] merged in a hour, thanks :) [09:26:30] Logged the message, Master [09:28:12] Hydriz: monday is my 20% day [09:28:26] Hydriz: and I usually spend the morning merging and deploying mediawiki-config changes [09:28:26] o_O thats good [09:35:02] New patchset: Hashar; "(bug 41717) Update default (language-neutral) Wiktionary logo" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31585 [09:35:44] New review: Hashar; "rebased and fixed conflict." [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/31585 [09:35:44] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31585 [09:36:15] !log hashar synchronized wmf-config/InitialiseSettings.php '(bug 41717) Update default (language-neutral) Wiktionary logo' [09:36:20] Logged the message, Master [09:40:08] now this may be language neutral, but its kinda funny: https://be.wiktionary.org/wiki/%D0%9F%D0%B5%D1%80%D1%88%D0%B0%D1%8F_%D1%81%D1%82%D0%B0%D1%80%D0%BE%D0%BD%D0%BA%D0%B0 [09:43:03] New review: Hashar; "Some minor questions." [operations/mediawiki-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/31580 [09:44:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:45:03] New review: Hashar; "Commented on bug 41712 as well." [operations/mediawiki-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/31580 [09:45:57] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.235 seconds [09:48:34] New patchset: Hashar; "wikidatawiki to use noticeproject::wikimedia banner" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31812 [09:48:56] New review: Hashar; "I have rephrased the commit message and rebased the changed." [operations/mediawiki-config] (master); V: 0 C: 2; - https://gerrit.wikimedia.org/r/31812 [09:48:56] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31812 [09:49:54] !log hashar synchronized wmf-config/InitialiseSettings.php 'wikidatawiki to use noticeproject::wikimedia banner' [09:49:58] Logged the message, Master [09:54:40] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [09:54:40] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [09:54:40] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [09:57:23] off for a few minutes [10:09:18] New review: Nikerabbit; "It would take less time if bad whitespace were not merged in the first place thus requiring followup..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31205 [10:20:28] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:25:27] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.804 seconds [10:26:25] paravoid: hi, could you log into tmh1 and pastebin or email me ps axu, am trying to figure out why no jobs are processed [10:28:21] ps axu | grep apache might be enough to see if jobs-loop.sh is running and what state it is in [10:37:24] j^: [10:37:25] apache 7194 0.0 0.0 13440 5380 ? SN Oct23 2:29 /bin/bash /usr/local/bin/jobs-loop.sh -t 14400 -v 0 webVideoTranscode [10:37:25] apache 23345 0.0 0.0 4308 356 ? SN 10:37 0:00 sleep 5 [10:37:29] only two lines of apache [10:38:52] /usr/local/apache/common/multiversion is empty [10:39:36] there should be MWScript.php and other things in it [10:40:05] this is why jobs-loops.sh fails; it can never get a next db since it doesn't find MWScript.php [10:47:49] apergos: do you know what puppet class creates /usr/local/apache/common/multiversion [10:48:13] I believe there is a special sync script that populates it [10:48:36] can you run it? [10:48:48] let me see if I can find it [10:49:58] ohh I can sync-dir, but can I do that for one host only? mm [10:50:06] nope. [10:50:13] * apergos does it the hard way [10:53:09] oh heh [10:53:16] /usr/local/apache/common is empty :-D [10:54:34] on labs /usr/local/apache/common -> /usr/local/apache/common-local [10:54:56] while all of /usr/local/apache is a symlink somewhere else (/data/project/apache) [10:55:20] not sure how far this should/is the same in production [10:55:37] tmh1/2 should have the same layout as jobrunners have [10:55:48] yes, well the point is [10:55:50] lrwxrwxrwx 1 root root 12 Oct 3 01:03 common -> common-local [10:56:01] but this dir is empty instead of having nice yummy mw installations in it [10:56:04] I'm doing the rsync now [10:56:45] the jobrunners are like any other apache box, same setup [10:57:04] (keeps it simpler to have one generic setup everywhere) [10:58:04] if tmh1 is not in the mw installation hosts for dsh you will want to add it, and also to the job runners group. (I wonder if those are in puppet yet.) [10:58:11] sounds good, just wondering why it was missing on tmh1/2 and how to make sure updates also get synced [10:58:30] ah do I need to do tmh2 also? [10:58:55] they have the same setup, so if tmh1 is broken i assume tmh2 is broken too [10:59:00] tmh2 is even worse off [10:59:20] * j^ looks how to add tmh1/2 to mw installation hosts for dsh [11:00:37] * apergos adds the symlink for common-local to tmh2 and does the same rsync here [11:00:37] *there [11:01:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:02:47] with all files in place /etc/init.d/mw-job-runner restart is needed, cd happens before the loop [11:03:03] yes indeed [11:03:08] well I am waiting for the rsyncs to finish [11:03:47] after that I'll try getting a next db from the command line and running a job [11:03:52] if that works then I'll restart the script [11:04:15] not able to find what needs to be done to have this happen automatically next time, should i file a bug in rt about it? [11:04:43] (these take a while because they copy over all the branches of mw that we have on fenari which is about 4 right now) [11:04:57] sure, if you file it that will highlight the problem we have keeping those files up to date [11:05:41] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [11:09:14] ok rt ticket at https://rt.wikimedia.org/Ticket/Display.html?id=3861 [11:09:26] great, thnk you [11:09:28] *thank [11:12:06] I see transcode jobs running now [11:12:10] tmh1 [11:12:47] same on tmh2 [11:13:02] I guess you are good to go for a little while [11:13:09] nice, thanks a lot [11:13:20] yw [11:14:46] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.393 seconds [11:35:01] PROBLEM - Frontend Squid HTTP on amssq54 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:37:52] ah guess I should log all that [11:38:10] RECOVERY - Frontend Squid HTTP on amssq54 is OK: HTTP OK HTTP/1.0 200 OK - 795 bytes in 0.235 seconds [11:38:11] !log synced apache/common-local by hand to tmh1 and tmh2 and restart job runners on both hosts (the directory was previously empty) [11:38:16] Logged the message, Master [11:49:47] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:58:11] PROBLEM - Frontend Squid HTTP on amssq54 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:59:39] RECOVERY - Frontend Squid HTTP on amssq54 is OK: HTTP OK HTTP/1.0 200 OK - 795 bytes in 0.243 seconds [12:01:07] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.476 seconds [12:08:30] New patchset: Mark Bergsma; "Increase the frontend cache size on servers with a lot of memory" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31819 [12:09:07] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31819 [12:18:28] New patchset: Mark Bergsma; "Apparently we don't have to_bytes() available" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31821 [12:18:45] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31821 [12:35:12] PROBLEM - Puppet freshness on erzurumi is CRITICAL: Puppet has not run in the last 10 hours [12:36:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:38:13] !log Performance testing cp3003 [12:38:18] Logged the message, Master [12:45:15] New patchset: Hydriz; "(bug 41757) Enable special:import on Hindi Wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31823 [12:46:14] PROBLEM - Frontend Squid HTTP on amssq57 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:49:14] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.416 seconds [12:51:02] RECOVERY - Frontend Squid HTTP on amssq57 is OK: HTTP OK HTTP/1.0 200 OK - 656 bytes in 1.179 seconds [12:56:12] PROBLEM - Frontend Squid HTTP on amssq57 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:59:20] RECOVERY - Frontend Squid HTTP on amssq57 is OK: HTTP OK HTTP/1.0 200 OK - 656 bytes in 0.235 seconds [13:22:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:37:19] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.031 seconds [13:43:20] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:44:53] RECOVERY - Apache HTTP on mw36 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.162 second response time [13:57:21] New review: Dereckson; "shellpolicy" [operations/mediawiki-config] (master) C: -1; - https://gerrit.wikimedia.org/r/31823 [13:58:11] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:58:11] PROBLEM - Apache HTTP on mw32 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:59:41] RECOVERY - Apache HTTP on mw32 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.959 second response time [13:59:41] RECOVERY - Apache HTTP on mw36 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.781 second response time [14:02:23] PROBLEM - Apache HTTP on mw42 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:02:23] PROBLEM - Apache HTTP on mw27 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:57] RECOVERY - Apache HTTP on mw27 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.045 second response time [14:03:57] RECOVERY - Apache HTTP on mw42 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.177 second response time [14:04:47] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:06:26] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.567 second response time [14:10:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:23:05] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.274 seconds [14:28:31] New review: Andrew Bogott; "I merged https://gerrit.wikimedia.org/r/#/c/31252/ so it should be possible to refactor this as a me..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/30593 [14:29:35] PROBLEM - Apache HTTP on mw34 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:31:11] RECOVERY - Apache HTTP on mw34 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.804 second response time [14:31:12] PROBLEM - Apache HTTP on mw38 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:32:41] RECOVERY - Apache HTTP on mw38 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.052 second response time [14:46:47] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [14:58:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:02:32] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:04:02] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.728 second response time [15:09:28] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.002 seconds [15:34:33] PROBLEM - Apache HTTP on mw27 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:35:40] RECOVERY - Host sq67 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [15:36:03] RECOVERY - Apache HTTP on mw27 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.262 second response time [15:39:25] PROBLEM - Host sq67 is DOWN: PING CRITICAL - Packet loss = 100% [15:42:34] RECOVERY - Host sq67 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [15:44:22] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:49:28] PROBLEM - Host sq67 is DOWN: PING CRITICAL - Packet loss = 100% [15:55:55] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.930 seconds [15:59:30] RECOVERY - Varnish HTTP bits on sq67 is OK: HTTP OK HTTP/1.1 200 OK - 632 bytes in 0.005 seconds [15:59:31] RECOVERY - Host sq67 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [16:05:14] New patchset: Reedy; "enwiki to 1.21wmf3" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31840 [16:12:14] New patchset: Pyoungmeister; "re-adding my contact to nagios" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31843 [16:13:25] glhf notpeter [16:13:37] PROBLEM - Puppet freshness on db42 is CRITICAL: Puppet has not run in the last 10 hours [16:13:37] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [16:13:37] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [16:14:59] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31843 [16:16:28] PROBLEM - Host sq68 is DOWN: PING CRITICAL - Packet loss = 100% [16:17:33] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31670 [16:17:49] !log Reinstalling sq68 with Precise [16:17:58] Logged the message, Master [16:22:25] RECOVERY - Host sq68 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [16:25:42] PROBLEM - Varnish HTTP bits on sq68 is CRITICAL: Connection refused [16:31:06] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:42:52] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.045 seconds [16:46:03] PROBLEM - NTP on sq68 is CRITICAL: NTP CRITICAL: No response from NTP server [16:49:03] PROBLEM - Host sq68 is DOWN: PING CRITICAL - Packet loss = 100% [16:51:41] RECOVERY - Host sq68 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [16:52:03] RECOVERY - Varnish HTTP bits on sq68 is OK: HTTP OK HTTP/1.1 200 OK - 630 bytes in 0.002 seconds [16:57:05] !log Reinstalling sq69 [16:57:10] Logged the message, Master [17:03:56] PROBLEM - Varnish HTTP bits on sq69 is CRITICAL: Connection refused [17:04:39] PROBLEM - SSH on sq69 is CRITICAL: Connection refused [17:07:57] RECOVERY - SSH on sq69 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [17:13:13] PROBLEM - MySQL Slave Delay on db78 is CRITICAL: CRIT replication delay 492344 seconds [17:13:30] PROBLEM - Host sq69 is DOWN: PING CRITICAL - Packet loss = 100% [17:18:12] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:19:40] !log DNS update - push wikivoyage.de (link) [17:19:51] Logged the message, Master [17:20:23] !log gracefulling all apaches to pick up https://gerrit.wikimedia.org/r/#/c/31670/ [17:20:28] Logged the message, notpeter [17:20:58] py is doing a graceful restart of all apaches [17:21:12] !log py gracefulled all apaches [17:21:16] Logged the message, Master [17:23:19] paravoid: As a result of that long email thread a month ago, I have in mind that I should set up a swift dev box to work on error handling and/or container sync. Do you think it actually makes sense for me to do that? I've sort of lost track of what our plan is for swift. [17:31:20] andrewbogott: i have a swift/mediawiki in one virtualbox script at https://github.com/bit/mediawiki_vm its not exactly the setup as in production but helped me to debug/test swift related issues [17:31:43] j^: Cool, that'd be a good place to start. Presuming I want to start :) [17:33:00] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.033 seconds [17:33:46] !log reedy synchronized php-1.21wmf3/extensions/MwEmbedSupport [17:33:46] Logged the message, Master [17:35:10] !log reedy synchronized php-1.21wmf3/extensions/TimedMediaHandler/ [17:35:16] Logged the message, Master [17:42:42] andrewbogott: no, I don't think we need anything there anymore [17:43:08] paravoid: How so? Are we abandoning swift? [17:43:40] no, there's no such plan yet [17:44:21] but container sync does not apply well into our problem and the swift people are preparing a different feature that may suit us better [17:44:46] they're already doing that work, I'm not sure if it makes sense to help them coding that [17:44:51] Ah, are they adding a feature to do actual multi-data-center replication? [17:45:03] yes [17:45:07] http://swiftstack.com/blog/2012/09/16/globally-distributed-openstack-swift-cluster/ [17:45:28] OK, that seems much better :) [17:45:52] so, yeah, they're already in the process of writing the code for that [17:45:52] How are we going to manage the datacenter migration in the meantime? Or do we hope to have that feature soon? [17:46:16] well [17:46:24] we do hope that we'll have that soon [17:46:30] and we have the netapps in place to actually have a copy for DR reasons [17:46:49] whether we'll use it as part of the eqiad migration remains to be seen [17:47:13] we'll probably going to leave upload to tampa while we migrate the rest to eqiad, but nothing's set in stone [17:47:16] OK. I was wondering why we can't just ship duplicate drives and then use container sync to correct the diff that appeared in the meantime... [17:47:30] no, it's much more complicated than that [17:47:52] * andrewbogott sort of wants to know and sort of doesn't. [17:48:07] container sync needs some sync points to be able to sync [17:48:12] if these aren't set, it's just going to replicate everything again [17:48:24] that's one of the problems with that approach [17:48:31] it doesn't work like rsync at all [17:48:40] (unfortunately) [17:48:43] Oh, OK. So I guess I'm presuming that container sync isnt' stupid :( [17:48:57] also, the way cont-sync is designed, I don't think it'll be able to keep up with the new files anyawy [17:48:57] it was dog slow [17:49:22] I also don't trust it enough to run in production with our critical data [17:49:47] it wasn't very well designed [17:49:57] the latest bug that I found was that once you set up sync for a container [17:50:03] you can never ever tear it down again :-) [17:50:34] I submitted it as a bug and they fixed it -- apparently a two-line fix [17:50:34] but still, kind of shows its quality [17:50:42] not by itself, but combined with all the other bugs I mailed about [17:51:27] Reedy: so... j^ opened a bug for us, about how MW is not getting synced to tmh1/tmh2 [17:51:40] is it just a matter of adding it to the dsh groups? [17:53:21] Yup, though notpeter has fixed it already :) [17:53:38] oh did he [17:53:40] * andrewbogott crosses 'swift' off of to-do list [17:53:54] crap :) [17:53:54] sorry for the noise [17:54:12] heh, it's fine [17:54:27] andrewbogott: well, if you have "spare" time, I guess you could interface with the swiftstack people about the region support [17:54:34] not sure if it makes much sense from WMF's perspective [17:54:43] not my call anyway :) [17:55:32] my opinion would be that having both of us not working into labs would be bad for the project [17:56:21] !log reedy synchronized php-1.21wmf2/extensions/MwEmbedSupport/ [17:56:21] Logged the message, Master [17:57:32] !log reedy synchronized php-1.21wmf2/extensions/TimedMediaHandler/ [17:57:37] Logged the message, Master [17:58:18] !log reedy synchronized php-1.21wmf3/extensions/TimedMediaHandler/ [17:58:19] Logged the message, Master [18:05:07] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:14:24] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [18:18:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.082 seconds [18:18:51] New patchset: Aaron Schulz; "Enabled TMH for all wikis except commons." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31863 [18:19:21] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.003 second response time on port 11000 [18:19:48] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31863 [18:22:01] !log aaron synchronized wmf-config/InitialiseSettings.php 'TMH deploy too all except commons.' [18:22:04] Logged the message, Master [18:23:41] Awesome, say mutante is it, and then leave :D [18:24:30] PROBLEM - Puppet freshness on db62 is CRITICAL: Puppet has not run in the last 10 hours [18:24:47] New patchset: Pyoungmeister; "Reduce scaler MaxClients to 18" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31865 [18:29:41] <^demon> Reedy: Delegate then flee? [18:30:12] Yup [18:35:16] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31865 [18:37:37] how hard would it be to get some queries ran on db9 / db10 ? [18:38:56] <^demon> Depends on the queries :) [18:39:42] showing the create tables and the row counts :P [18:39:59] more intimate queries could appear later [18:40:13] <^demon> On what database? [18:40:45] otrs one, not sure about the exact database name [18:40:49] although it's probably 'otrs' [18:41:09] <^demon> I thought otrs was moved off db9/10. [18:41:26] maybe, I got that info from wikitech, so... [18:44:38] sigh, there's a mention in the SAL from 2008 [18:44:43] Srv38 claims to be otrs db server, but that's from 2006 [18:45:35] <^demon> I'm pretty sure it's not srv38. [18:45:43] <^demon> :) [18:45:43] <^demon> binasher may know. [18:46:31] the original information (db9 & db10) turns out to be from 2009 - https://wikitech.wikimedia.org/index.php?title=OTRS&diff=18337&oldid=14622 [18:47:03] <^demon> I could've sworn it was moved. Anyway, binasher would be the guy to ask. [18:47:03] yeah, I'm pretty sure it got moved [18:47:05] cause it was too big [18:47:06] Platonides: db48 is the master, i'll update the wiki page [18:47:11] and db9 didn't have enough space [18:47:19] I remember something about moving, but not where [18:47:43] wiki updated [18:48:24] so.. can those queries be run? [18:49:17] what queries? Jeff_Green usually handles otrs support, tho he may be busy with fr [18:50:06] binasher: meh. that's a fallacy that needs to be slain [18:50:12] hahah [18:50:19] I just want to figure out the layout and whether we are actually using those tables or not [18:50:35] nobody takes care of poor otrs :P [18:50:42] these queries http://pastebin.com/hp6Kr5CP [18:51:08] Jeff_Green: it will live on as a zombie fallacy after slaying! [18:51:30] "Jeff helped one person once with OTRS, now he's the sole supporter for the entire org!" [18:51:57] Jeff_Green: hey, that's how I became our DBA ;) [18:51:59] ha [18:52:02] <^demon> That's what you get for helping people ;-) [18:52:15] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:52:46] that looks cut and pasteable tho.. Platonides i can run those for you [18:53:01] thanks [18:53:07] * Jeff_Green is gonna add little johnny drop tables... [18:53:18] I used explain on the selects that looked slow [18:53:34] you may still want to run it on the slave, though [18:53:50] which is probably idle most of time [18:53:50] (ie. just replicating) [18:53:59] Platonides: where do you want the output? [18:54:23] do you have a fenari account? [18:55:34] while we're at it, would be lovely to get a tarball of the OTRS code. because no one seems to know exactly what version from CVS the quilt is based off of [18:56:18] binasher, I don't [18:56:34] although I wouldn't complain if you enabled one while you are at it xD [18:57:06] you could use labs bastion instead [18:57:06] Platonides: slave and master are not black and white here... [18:57:07] jeremyb, +1 [18:57:55] jeremyb, I tried to guess from the last svn date of our patches, for extracting the db layout [18:58:04] Platonides: k [18:58:32] Platonides: have you read the upgrade instructions to get to a more recent version? they made me cry [18:58:33] db48 and db49 are missing from [[Server_roles]] btw [18:58:49] wtf is [[Server_roles]] [18:59:01] https://wikitech.wikimedia.org/view/Server_roles [18:59:20] jeremyb, at least there are instructionf for upgrading [18:59:27] Platonides: yeah! [18:59:57] that they were last used by Tim in 2009 is "unrelated" [19:00:07] PROBLEM - Packetloss_Average on oxygen is CRITICAL: CRITICAL: packet_loss_average is 26.5414620833 (gt 8.0) [19:00:11] or that nobody tried to follow them in the last 3 years... [19:01:16] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31840 [19:02:08] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: enwiki to 1.21wmf3 [19:02:18] Logged the message, Master [19:07:15] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.024 seconds [19:20:43] AaronSchulz: which projects should we enable pecl-memcached on this afternoon? [19:21:44] binasher: maybe zh -> de -> more [19:21:57] binasher: would you see a reason not to do all of them if it goes well? [19:22:13] or maybe wait a day for that [19:22:29] AaronSchulz: nope, i'm looking forward to doing all of them, and disabling multiwrite [19:22:35] waiting a day might be smart though [19:22:50] yeah, but beyond that I don't see much reason to drag it out [19:23:17] AaronSchulz: the caches array in wgObjectCaches['memcached-multiwrite'] [19:23:33] do reads always happen from slot 0 first? [19:23:36] heya Jeff_Green [19:23:43] is this you? [19:23:49] tail -50000000 bannerImpressions-sampled1.log [19:23:49] on oxygen? [19:24:06] binasher: yes, so reads come from the old ones [19:24:08] ottomata: yes [19:24:11] unless stuff doesn't exist there [19:24:22] I'm trying to extract 10 minutes of 1:1 logs for the fr folks to analyse [19:24:29] i think its causing packet loss, at least, its the only new thing I know about, [19:24:38] probably [19:24:39] i'm not sure what else to do [19:24:41] can we kill it? [19:24:43] AaronSchulz: that seems like a bad test [19:24:48] can you copy the logs elsewhere and analyize? [19:25:04] Jeff_Green: we like to have all data manipulation stuff to happen on stat1 [19:25:20] binasher: yeah, we'd want to swap before doing the huge ramp up [19:25:33] PROBLEM - Puppet freshness on ms-fe3 is CRITICAL: Puppet has not run in the last 10 hours [19:25:45] it's a >200GB log, i figured the copy would cause packet loss too [19:25:47] but I want to add a few wikis first, then swap, then do more [19:26:26] iIRC, syncing files has never caused packet loss so far [19:26:37] AaronSchulz: i'd want to add a few and swap the order at the same time.. or swap the order now [19:26:46] drdee: alright. i'll try that [19:27:12] thx [19:28:14] binasher: I was thinking of testing if it handles the writes, then writes+reads, but...I could go either way since it probably won't matter [19:34:34] New patchset: Asher; "returning appserver MaxRequestsPerChild to pre-ppnode limit levels" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31879 [19:35:49] AaronSchulz: as long as pecl is in slot 0 before more than a few additional wikis are added. i don't think the write-only test has much value though [19:36:48] yep [19:37:57] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31879 [19:39:03] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:44:41] binasher, do you have any DB-related concerns regarding https://gerrit.wikimedia.org/r/#/c/27610/ ? [19:48:46] !log payments flipped to eqiad cluster [19:48:47] Logged the message, Master [19:49:23] MaxSem: i don't think so. where is the geo_killlist table defined though, was it already created in an earlier version of the geodata extension? [19:50:43] MaxSem: testwiki only has geo_tags, are there schema migrations in another changeset? [19:52:24] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.680 seconds [19:54:00] didn't the interwiki links used to work on wikitech? http://wikitech.wikimedia.org/view/Software_deployments [19:54:40] New review: Dzahn; "redirects for wlm. per multichill" [operations/apache-config] (master); V: 1 C: 2; - https://gerrit.wikimedia.org/r/31303 [19:54:40] Change merged: Dzahn; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/31303 [19:54:48] robla: i don't think so, since wikitech isn't actually a wmf wiki [19:55:04] binasher: that's not strictly required [19:55:09] robla: its since the recent upgrade [19:55:25] robla: before we had manually filled the interwiki table in db at some point to make them work [19:55:36] any plans on fixing that? [19:55:36] and now it does not use the table anymore [19:55:36] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [19:55:36] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [19:55:36] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [19:55:36] binasher, the killlist is in https://gerrit.wikimedia.org/r/#/c/27610/6/sql/externally-backed.sql - since testwiki has very few geo tags and is the only production installation, I think it would be easier just to recreate the tables and edit the template that populates the DB [19:56:14] robla: briefly talked to Roan already, need to find out how.. [19:57:01] MaxSem: oh hah, i misread that as being removed, not renamed [19:57:03] RECOVERY - Packetloss_Average on oxygen is OK: OK: packet_loss_average is 0.0 [19:57:30] mutante: it's maintenance/interwiki.sql [19:59:57] robla: i have been told we dont use the interwiki table anymore while in docs i see it still talks about it in 1.9 [19:59:59] looking in maintenance [20:00:02] <^demon> We don't use the interwiki table at wmf. [20:00:14] <^demon> (Default installs for 3rd parties do use it) [20:00:17] Does wikitech localsettings mention an interwikicache cdb? [20:00:30] ^demon: we still have the table in db [20:00:30] I'm guessing not.. [20:00:30] MaxSem: no concerns with the db usage there, i'll +1 and we can merge after it gets a regular code review from nikerabbit [20:00:37] binasher, thanks! [20:01:04] wikitech is probably a lot more like a vanilla MW install [20:01:04] <^demon> mutante: For awhile, we did need both because the API when returning a list of interwiki prefixes couldn't read the full list from the CDB file. [20:01:20] Reedy: that sounds like it, no such setting in LocalSettings [20:01:20] <^demon> Nowadays on wmf sites, we don't need the table. [20:01:23] <^demon> I don't know about wikitech, but it's likely to just use the table. [20:01:26] New review: Multichill; "If redirect to the toolserver so they don't get blackholed this is fine with me." [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/31302 [20:01:39] ^demon: it used to use the table before it was upgraded [20:03:15] i dont seem have a .cdb file [20:03:15] yes, it did [20:03:48] I used to know how to do this (as in I did it once before for some wiki or other) [20:04:05] something tells it to not use the table [20:04:21] but there is no setting to point to a cdb either [20:04:38] I think modern versions just don't use the table [20:04:38] iirc [20:04:48] this is what i keep hearing :) yeah [20:05:29] but .. http://www.mediawiki.org/wiki/Interwiki_cache [20:05:46] MediaWiki does not contain a script to build or update such a CDB cache file (bug 33395), .. [20:05:46] Platonides: pls poke me if you get the dump [20:05:46] that's really old [20:05:51] "WikimediaMaintenance contains dumpInterwiki.php and rebuildInterwiki.php which are custom Wikimedia-specific scripts used for the CDB cache of all Wikimedia wikis."... [20:06:02] apergos: surprise :) [20:06:18] the part about the tables I mean [20:09:21] hahaha you need /home/wikipedia/common/langlist and some other crap [20:09:39] to run dumpInterwiki.php [20:09:41] figures [20:09:54] can i not just tell it to use the mysql table again? hrrr [20:09:54] Steal a copy from the wikimedia cluster [20:10:00] of a .cdb ? [20:10:01] That's what I've done before for local usage [20:10:03] Yeah [20:10:04] PROBLEM - Packetloss_Average on oxygen is CRITICAL: CRITICAL: packet_loss_average is 61.0388428333 (gt 8.0) [20:10:11] ok [20:10:17] /home/wikipedia/common/php/cache/interwiki.cdb [20:10:35] hello there around :-] [20:11:01] any ops with LDAP knowledge around? We have some users with GID 550 (svn) and others with GID 500 (wikidev) which mess things on the beta cluster :-( [20:11:35] I'm pretty sure I also stole a copy (but I don't know if I had to massage it later) [20:11:50] and no you can't tell it to use the table again [20:13:31] it's pretty great cause newer created wikis don't have anything in the table but old ones have a bunch of crap in there [20:13:44] (on our production cluster) [20:14:06] I think I might have cleared those tables.. [20:14:07] at some point [20:14:08] maybe [20:14:17] * apergos goes to check [20:14:41] ok, $wgInterwikiCache="./cache/interwiki.cdb"; [20:14:58] nope, still have content [20:15:06] lol [20:17:01] PROBLEM - Apache HTTP on mw46 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:17:35] it wont use the .cdb yet..sigh [20:17:44] $wgInterwikiCache="$IP/cache/interwiki.cdb"; [20:17:44] Hmmm, Jeff_Green [20:17:51] oxygen is still angry [20:17:56] how new is the Temporarily catch all logs for banner impressions filter? [20:18:07] oh pretty new, right? [20:18:07] friday? [20:18:10] a month? [20:18:12] eerrrghhhh [20:18:13] oh [20:18:24] you just changed it on friday? ('m just looking at git blame) [20:18:24] i added a couple new strings last week [20:18:41] $wgInterwikiCache = "$IP/cache/interwiki.cdb"; [20:18:41] that's what I had [20:18:41] RECOVERY - Apache HTTP on mw46 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.377 second response time [20:18:48] that's form a local 119 install [20:18:57] apergos: looks identical then ..hrm [20:18:59] so, Jeff_Green! [20:19:03] ottomata: the fetch has finished [20:19:06] did you know ( i did not until a couple of weeks ago) [20:19:14] that there is a multicast udp2log stream? [20:19:18] meaning you can consume udp2logs from anywhere? [20:19:30] i had heard rumors of this [20:19:35] its really easy to set up, you can just run your own udp2log instance, just like it runs on oxygen [20:19:47] i should probably make a generic puppetization of that [20:19:53] is the cdb file in the right place and readable by the web server etc? [20:19:55] mutante: [20:20:04] did the fetch *just* finish? [20:20:13] i just got the alter about 5 mins ago about packet loss [20:20:36] apergos: yes, owned by www-data actually, and in ./cache/ [20:20:52] Jeff_Green ^ [20:20:56] ottomata: the "socat" process is a little suspicious [20:21:08] that is actually the multicast stream :) [20:21:14] i suspicioued on that one once before, ha, and killed it [20:21:18] and then all udp2logs stopped [20:21:25] that's how I found out about the stream [20:21:33] so on oxygen right now [20:21:46] ottomata: not sure, within the past 15 minutes? [20:21:49] socat listens to the udp stream from the squids+, and then forwards to multicast addy [20:21:59] ottomata: oh ha [20:22:01] then udp2log instance on oxygen subscribes to multicast addy [20:22:24] binasher: fluorine logs are f*cked up [20:23:05] just do "ls -l" ;) [20:23:05] $wgInterwikiScopes This setting specifies the number of domains to check for messages: [20:23:06] ? [20:23:58] ok, i'm going to give it a few more minutes before I start hunting, hopefully it was only your copy that was making it angry [20:23:58] maybe but I never set that [20:27:44] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:28:41] ottomata: I wouldn't be surprised [20:31:08] New patchset: Hashar; "Simple renaming misc::beta::{scripts,autoupdater}" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31886 [20:31:55] patch-interwiki.sql is in ./maintenance/archives (!) [20:32:18] did you just fix wikitech? [20:32:45] cause it now seems to be ok [20:33:02] does it? [20:33:04] yes, you don't want the sql for anything [20:33:09] so it's in archives [20:33:18] not for me in preview ... [20:33:23] well I just am previiewing a link on my page [20:33:34] RECOVERY - Packetloss_Average on oxygen is OK: OK: packet_loss_average is -0.163826470588 [20:33:34] m: something and it points me to meta [20:33:51] oh m: works, w: and wikt: don't [20:34:10] wth [20:34:27] Jeff_Green, packet loss back to normal, so phew :) [20:35:48] New patchset: Hashar; "role::beta::autoupdater" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31888 [20:36:07] can I get a review + merge of https://gerrit.wikimedia.org/r/31886 and https://gerrit.wikimedia.org/r/31888 please ? :-] [20:36:37] apergos: [[wiktionary:foo]] works, but usually we have wikt: dont we [20:36:54] yes there are shortcuts [20:37:28] and they are missing [20:40:12] !log removing sq67 from bits cache pool for upgrade to precise [20:40:16] Logged the message, notpeter [20:40:36] apergos: i dont think they are in that .cdb .. i installed "freecdb" to use cdbdump..and looking [20:40:54] heh I am trying to get at the contents with the maintenance script and failing... [20:40:59] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.018 seconds [20:41:05] apergos: cdbdump < interwiki.cdb | less [20:41:37] package freecdb. brb [20:42:24] !log re-adding sq67 to bits cache pool w/o upgarde and removing arsenic from eqiad bits cache pool for actual upgrade to precise [20:42:30] Logged the message, notpeter [20:50:29] PROBLEM - Host arsenic is DOWN: PING CRITICAL - Packet loss = 100% [20:51:06] * jeremyb wonders who owns arsenic [20:51:17] ah, it's not peter [20:51:17] see above [20:51:36] * jeremyb sees [20:55:49] RECOVERY - Host arsenic is UP: PING OK - Packet loss = 0%, RTA = 26.71 ms [21:03:14] hm who do you want to kill [21:03:39] New patchset: Asher; "adding de+zhwiki to pecl-memcached test and swapping multiwrite order" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31890 [21:03:41] mutante: "shortcuts" are defined in a different way because they depend on the wiki language [21:04:03] so that w: links to en.wiki from en.*, to fr.wiki from fr.* etc. [21:04:24] and multilingual link to en [21:04:48] New patchset: Asher; "adding de+zhwiki to pecl-memcached test and swapping multiwrite order" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31890 [21:05:03] so the problem probably lies there, but those interwikis are actually bad as brion said somewhere, so maybe no need to bother [21:05:23] Nemo_bis: aha, gotcha, any idea what i need for them ? since we alread point to the interwiki.cdb [21:07:17] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [21:07:17] http://meta.wikimedia.org/wiki/Help:Interwiki_linking#Project_titles_and_shortcuts even though it might be surprising that mw: works when it is also listed as a shortcut [21:07:32] Names.php [21:07:47] but that ought to just work out of the box, unlessssssss [21:07:54] Change merged: Aaron Schulz; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31890 [21:08:02] who can put packages into http://apt.wikimedia.org/? [21:08:53] apergos: we have that file, i see all the languages in there, but not shortcuts yet [21:09:07] $coreLanguageNames = array( [21:09:17] !log aaron synchronized wmf-config/mc.php 'Added (zh|de)wiki to memc multiwrite and switched read order.' [21:09:24] Logged the message, Master [21:10:03] AaronSchulz: memcached.log verbosity might be too high [21:10:37] would be nice when someone cleans up mw-log/ dir too btw :) [21:10:54] whoa, wtf [21:11:09] I was complaining about that earlier ;) [21:12:00] New patchset: Jgreen; "disable banner 1:1 logging on oxygen, not needed atm" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31903 [21:12:50] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31903 [21:14:16] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:15:12] * AaronSchulz wonders what's with 10.0.11.28 [21:16:21] j^: oh? you're upstream for ffmpeg2theora? [21:16:35] paravoid: yes [21:16:46] cool! [21:19:09] !log copied interwiki.cdb from prod to wikitech, used $wgInterwikiCache to point to .cdb, mw does not use mysql table anymore. long iw links work again, "shortcuts" are different though and still an issue [21:19:13] Logged the message, Master [21:19:31] PROBLEM - Host arsenic is DOWN: PING CRITICAL - Packet loss = 100% [21:19:39] the answer to your question is "all ops", doing it now [21:19:54] !log rebooted mw28 [21:19:56] Logged the message, Master [21:20:27] hrm [21:21:56] AaronSchulz: i cleaned up mw-log [21:22:45] mutante: can you also close/update the bug? [21:22:52] i did [21:22:58] New patchset: Eloquence; "Add Wikivoyage logo (old logo until new one has been decided)." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31941 [21:23:05] AaronSchulz: do you see any issues? [21:23:14] I see a few more bogus files, but it looks mostly clean [21:23:19] j^: uploaded, upgraded on tmh1/2, killall ffmpeg2theora ran [21:23:33] mutante: oh sorry, thanks [21:23:44] binasher: 10.0.11.49 [21:23:47] j^: and it'd be nice if ffmpeg/libav actually stopped/crashed when reaching their memory limits instead of looping forever [21:24:07] AaronSchulz: what about it? [21:24:25] still spamming the logs [21:25:07] ? [21:25:22] RECOVERY - Host arsenic is UP: PING OK - Packet loss = 0%, RTA = 26.54 ms [21:25:30] you are looking at memcached-serious.log right? [21:25:49] no, memcached.log [21:26:27] memcached-serious.log only contains log data on the old client [21:26:56] mw49 spamming it doesn't actually have anything to do with the pecl ext [21:27:01] PROBLEM - Apache HTTP on mw28 is CRITICAL: Connection refused [21:27:03] the pecl mc uses -serious too [21:27:25] and yet there isn't a log line it related to pecl [21:27:40] grep -v 11000 memcached-serious.log [21:27:57] yes, no errors [21:28:33] I'm just pointing out mc stuff while since we are looking at it, though yeah, it's the old boxes as I can see from the IP [21:29:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.036 seconds [21:29:33] mutante: 53896.log and #61440.log are also bogus [21:29:43] PROBLEM - Varnish HTTP bits on arsenic is CRITICAL: Connection refused [21:29:55] paravoid: yes not sure how to fix that but will do some reading on it [21:30:13] arg [21:30:23] PROBLEM - SSH on arsenic is CRITICAL: Connection refused [21:30:23] RECOVERY - Apache HTTP on mw28 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.688 second response time [21:30:53] mutante: notpeter: paravoid: we have a lot of apaches that are on old buggy lucid kernels and have >211 days uptime :( [21:31:29] that means they are not buggy enough [21:32:07] yeah, if they were buggy enough to cause a reboot, we could ignore it [21:32:07] binasher: I should disable memcached.log when we finish in fact [21:32:37] as used, it only seems practical for debugging or dev installs [21:33:23] New patchset: Eloquence; "Remove BreadCrumbs, add GeoCrumbs. BreadCrumbs was erroneously imported. It is not to be deployed anywhere. GeoCrumbs is used by Wikivoyage for hierarchical navigation hints." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31947 [21:34:21] New patchset: Eloquence; "Remove BreadCrumbs, add GeoCrumbs." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31947 [21:34:27] mutante: is the script you wrote to determine which hosts in a hostgroup need kernel upgrades still around? [21:34:52] notpeter: if you're upgrading arsenic, you're probably hitting problems with the bonded links [21:35:08] apparently that broke in precise [21:35:41] blerg [21:35:41] ok [21:35:42] mark: do you know what's up with the mc hosts in precise? did new cables and/or optics actually get ordered for the last few hosts? [21:36:00] binasher: not sure [21:36:00] mark: what did you do to fix it? [21:36:00] PROBLEM - Host mw49 is DOWN: PING CRITICAL - Packet loss = 100% [21:36:07] RobH: ping [21:36:12] sup [21:36:23] !log reboot mw49 for kernel upgrade [21:36:23] Logged the message, Master [21:36:24] notpeter: manually, adding "bond-master bond0" to all the slave stanzas in /etc/network/interfaces, see e.g. sq67 [21:36:31] still need to automate that in puppet [21:36:32] the intel SFP i requested today an updated quote for just a couple [21:36:37] ok, cool [21:36:38] that's why sq69 is down, and sq70 isn't upgraded yet [21:36:42] so we can order them and test them before i commit to another big possibly wrong order [21:36:43] binasher: yeah not a single problem with the new ones, want to do more? [21:36:47] notpeter: it also breaks the first puppet run [21:36:58] how so? [21:37:07] binasher: i assume mark was pinging me about your question [21:37:07] =] [21:37:11] because it now dies during the first puppet run when it converts to a LAG [21:37:21] where it would just work without issues before [21:37:30] ah, gotcha [21:37:40] binasher: So by tomorrow I hope to have an order in, with overnight shpping, so hopefully by end of this week we can try them with unpatched percise [21:37:42] and see if it works [21:37:51] RobH: i thought we already figured out what combo works in pmtpa? [21:38:06] it works, but not without you guys doing shit to make it work [21:38:21] I thought the point was to get a solution that requires no extra work when we change kernels [21:38:24] ah, not needing to do that would be good [21:38:27] * AaronSchulz likes how half of the udp logs are in IS and the other in CS [21:38:38] note that our R620s in esams have no issues with unsupported SFPs [21:38:39] yep, thats the plan, if the test we get in this week works [21:38:40] did you hear that Reedy? [21:38:41] they use broadcoms, not intels [21:38:52] then I will just order the intel sfps for the ones we have that have the issue [21:38:53] RobH: that sounds great, let me know when the order comes in [21:39:00] RECOVERY - Host mw49 is UP: PING OK - Packet loss = 0%, RTA = 0.56 ms [21:39:05] you bet, cuz im poking you or notpeter to test ;] [21:39:07] so for future Rx20 hosts it should be ok [21:39:13] mark: thats good news [21:39:27] RobH: I can help with that [21:40:22] cool, CT had asked me about this last week and i have a followup call with him later [21:40:27] but its on my list and in progress [21:44:50] AaronSchulz: lets.. how many more should we add? [21:44:55] New patchset: Aaron Schulz; "Use memcached-multwrite for all wikis." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31948 [21:45:07] binasher: never mind that ;) [21:45:33] actually that's probably ok [21:46:32] AaronSchulz: yeah, i'm ok with that. first, let me do an apache-graceful-all to make live a maxrequestperchild change i pushed out earlier [21:47:04] binasher: also, fyi, we now also have the parts onsite to wire it the way it is in tampa if we need it done before the week is out. [21:47:08] for memcached. [21:48:22] chris had spares in tampa and brought them up here [21:48:45] AaronSchulz: What? [21:48:53] RobH: ok, cool.. we still have time to try to get the sfp compatibility issue solved tho [21:49:00] asher is doing a graceful restart of all apaches [21:49:03] Reedy: I was just hinting at some possible code cleanup [21:49:07] awesome, just wanted you to know all the options [21:49:14] !log asher gracefulled all apaches [21:49:16] Logged the message, Master [21:49:19] move all the udp log code to InitialiseSettings [21:49:43] PROBLEM - NTP on arsenic is CRITICAL: NTP CRITICAL: No response from NTP server [21:51:26] binasher: is it done? [21:51:41] Change merged: Asher; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31948 [21:51:43] guess so [21:51:45] AaronSchulz: yah, deploy away [21:52:05] should we take bets on a site outage? [21:52:50] well that would only happen with bad timeouts ;) [21:52:57] * AaronSchulz could bet a $3 beer nothing would explode [21:54:11] !log aaron synchronized wmf-config/InitialiseSettings.php 'Disabled spammy (with pecl memcached) memcached.log, the useful one is memcached-serious.' [21:54:13] Logged the message, Master [21:54:45] AaronSchulz: i would be happy to buy you $3 beer [21:54:50] !log aaron synchronized wmf-config/mc.php 'Switched all wikis to memcached-multiwrite.' [21:54:55] binasher: come to Europe :-] [21:54:56] Logged the message, Master [21:55:10] AaronSchulz: lol, do you see the "ITEM TOO BIG" errors? [21:55:26] let me ssh back into to tail [21:55:28] that must have always been a prob.. but wtf, are those keys really >1mb? [21:55:34] 2012-11-05 21:54:49 mw34 frwikisource: Memcached error for key "commonswiki:file:bf829888c5580d1c337c817973ff39a7" on server "10.0.12.4:11211": ITEM TOO BIG [21:55:47] New patchset: Catrope; "Add VisualEditor namespace creation to wmf-config" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31949 [21:55:56] yeah I see, moderate amount of spam [21:56:13] it's a few keys.. can we write the contents of one to a file? [21:56:17] somehow [21:56:25] binasher: I bet it is big file metadata [21:56:40] file repo caches that with the file DB info, and it tends to get huge in some cases [21:56:40] file metadata > 1mb sounds crazy [21:57:32] binasher: see LocalFile::isCacheable() ;) [21:57:54] that check is just for the in-process cache though [21:59:16] still the only real errors are still for the old ones [21:59:46] AaronSchulz: CACHE_FIELD_MAX_LEN should be larger in this case [22:00:12] but still.. damn [22:00:13] binasher: it might avoid repeat fetch/sets [22:00:14] tha'ts a lot of metadata [22:00:15] yeah [22:00:34] though then the RepoGroup::MAX_CACHE_SIZE value would need to drop [22:00:42] or it's OOM time [22:00:43] ok, off to a meeting [22:00:59] ok, this seems like a fine place to leave it for now [22:01:00] binasher: looks like there is lots of time to schedule the old cache removal from multwrite tomorrow [22:01:12] yup [22:01:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:01:57] and while port 11211 is monitored by nagios on the individual mc servers, there's a script that checks all of them that needs updating [22:02:13] AaronSchulz: i'd like to test persist connections too, now that maxreqsperchild is back up [22:10:43] !log rebooting mw21,25,40 for kernel upgrades [22:10:45] Logged the message, Master [22:12:00] PROBLEM - Apache HTTP on mw21 is CRITICAL: Connection refused [22:14:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.037 seconds [22:17:16] PROBLEM - Apache HTTP on mw40 is CRITICAL: Connection refused [22:20:33] mark: did you have a problem with late_command not running on the esams bits caches when you reinstalled them? [22:20:33] RECOVERY - Apache HTTP on mw40 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.059 second response time [22:21:47] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31947 [22:21:51] notpeter: how do you figure that? [22:21:54] RECOVERY - Apache HTTP on mw21 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.051 second response time [22:22:47] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31941 [22:23:55] !log rebooting srv191,225,230 for kernel upgrades [22:24:02] Logged the message, Master [22:24:18] binasher: are there more like those? [22:24:21] sorry, just saw the backlog [22:25:17] what kernel version were they were runnin? [22:25:19] *running [22:25:22] paravoid: i just finished upgrading all that are up for > 191 days, but the bulk of the lucid app servers are at 191 days [22:25:32] I fired up servermon and I have kernel versions for all of our machines [22:26:03] paravoid: i think notpeter can upgrade them all to precise within that time windo [22:26:08] the blocker was memcached on the apaches [22:26:18] ah [22:26:18] which will no longer be used as of tomorrow [22:26:19] yay [22:26:24] yay² [22:26:41] have you looked at that performance issue at all though? [22:26:55] it might bite us hard if we upgrade the whole appserver fleet to precise [22:27:12] hashar is asking Tim to have a look [22:28:21] paravoid: I jsut tried installing arsenic twice, and both times it failed to run late_command, which has important stuff like an ssh key and directive to install ssh-server [22:28:32] we are in our weekly call, added that to our agenda. [22:28:36] hashar: thanks! [22:28:51] thought it seems to be related to kernel / driver / raid setup or something [22:31:21] PROBLEM - Apache HTTP on srv191 is CRITICAL: Connection refused [22:31:21] PROBLEM - Apache HTTP on srv230 is CRITICAL: Connection refused [22:31:40] PROBLEM - Apache HTTP on srv225 is CRITICAL: Connection refused [22:33:09] PROBLEM - Frontend Squid HTTP on amssq61 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:35:56] paravoid: i don't think that issue has been seen with the app servers which have been upgraded to precise [22:36:19] I haven't checked [22:36:19] RECOVERY - Frontend Squid HTTP on amssq61 is OK: HTTP OK HTTP/1.0 200 OK - 795 bytes in 0.239 seconds [22:36:23] when i looked at it, the wait and disk i/o was all to sqlite files [22:36:34] which nothing else uses [22:36:36] PROBLEM - Puppet freshness on erzurumi is CRITICAL: Puppet has not run in the last 10 hours [22:37:04] I was talking about sqlite's fsync() behavior today too [22:37:04] binasher: yeah, upgrade-helper [22:37:19] but hashar told me that he's seen it on non-sqlite tests too [22:39:24] where are we using sqlite outside of doing tests? [22:39:27] RECOVERY - Apache HTTP on srv191 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.053 second response time [22:39:54] Platonides, swift [22:39:56] (and jenkins should place them on tmpfs or use a patched sqlite...) [22:40:40] wtf does swift need sqlite? [22:40:52] yeeaaah [22:41:14] swift does use sqlite, but that's completely off-topic here [22:41:39] Reedy: we have the issue with the dbdump tests [22:41:40] 30 sec run time to 1 minute [22:41:58] there might be several issues though [22:42:10] hashar, are they run in tmpfs? [22:42:16] nop on disk [22:42:24] tmpfs would be an idea [22:42:26] move them to memory [22:42:54] makes no sense to wait for disk when we don't really care if the box crashed [22:43:02] it is not going to make the Dump tests faster though [22:43:03] RECOVERY - Apache HTTP on srv225 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.052 second response time [22:43:08] paravoid: have you seen performance issues with production apache hosts? i haven't [22:43:24] we could probably also use a LD_PRELOADed fake fdatasync() [22:43:38] depending on how is it called by jenkins [22:44:41] Platonides: anyway I am more worried about having perf issue on the application servers :-D [22:48:11] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:48:56] looking at lucid vs. precise apaches in slow-parse.log [22:49:21] PROBLEM - check_squid on payments1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name squid [22:49:27] 95% time for just lucid apaches is 16.39s, for precise apaches 16.24s [22:50:28] PROBLEM - check_squid on payments1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name squid [22:51:14] https://www.mediawiki.org/wiki/Wikimedia%20engineering%2020%%20policy [22:51:17] 400 Bad Request? [22:51:32] WFM [22:52:21] Krinkle: Try https://www.mediawiki.org/wiki/Wikimedia%20engineering%2020%25%20policy [22:52:28] ? [22:52:28] yeah [22:52:45] (urlencode('%') == '%25') [22:52:50] I know how to get to the page, but there's a weird encoding issue recently [22:53:01] heh 20% is actually in the title, %20 is space [22:53:06] RECOVERY - MySQL Slave Delay on db78 is OK: OK replication delay 0 seconds [22:53:12] Yup [22:53:12] I think it started since we started using nginx [22:53:33] https://www.mediawiki.org/wiki/Wikimedia_engineering 20% policy [22:53:42] put that in your browser url [22:54:05] nginx complaining [22:54:19] whee, 2 different errors [22:54:48] https://www.mediawiki.org/wiki/Wikimedia_engineering_20%25_policy [22:54:57] http://cl.ly/image/1L2Y2C3v1C0S [22:54:59] yeah, 25 appearing in there [22:55:21] PROBLEM - check_squid on payments1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name squid [22:55:21] PROBLEM - check_squid on payments1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name squid [22:55:21] PROBLEM - check_squid on payments4 is CRITICAL: PROCS CRITICAL: 1 process with command name squid [22:55:21] PROBLEM - check_squid on payments1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name squid [22:55:22] PROBLEM - check_squid on payments1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name squid [22:55:26] PROBLEM - check_squid on payments2 is CRITICAL: PROCS CRITICAL: 1 process with command name squid [22:55:26] PROBLEM - check_squid on payments3 is CRITICAL: PROCS CRITICAL: 1 process with command name squid [22:55:26] PROBLEM - check_squid on payments1 is CRITICAL: PROCS CRITICAL: 1 process with command name squid [23:00:27] PROBLEM - check_squid on payments1003 is CRITICAL: PROCS CRITICAL: 1 process with command name squid3 [23:00:27] PROBLEM - check_squid on payments1001 is CRITICAL: PROCS CRITICAL: 1 process with command name squid3 [23:00:27] PROBLEM - check_squid on payments1002 is CRITICAL: PROCS CRITICAL: 1 process with command name squid3 [23:00:27] PROBLEM - check_squid on payments2 is CRITICAL: PROCS CRITICAL: 1 process with command name squid [23:00:27] PROBLEM - check_squid on payments3 is CRITICAL: PROCS CRITICAL: 1 process with command name squid [23:00:28] PROBLEM - check_squid on payments4 is CRITICAL: PROCS CRITICAL: 1 process with command name squid [23:00:28] PROBLEM - check_squid on payments1004 is CRITICAL: PROCS CRITICAL: 1 process with command name squid3 [23:00:29] PROBLEM - check_squid on payments1 is CRITICAL: PROCS CRITICAL: 1 process with command name squid [23:01:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 4.819 seconds [23:02:56] hashar: so I can see the problem on gallium by just running the tests from the command line? [23:03:18] TimStarling: yup by running the Dump test suite. There is a copy in my homedir [23:03:31] /home/hashar/mwcore/mediawiki-78a5729 [23:03:57] then php tests/phpunit/phpunit.php --group Dump [23:04:01] I wrote some random thoughts on https://wikitech.wikimedia.org/view/User:Hashar/bug41607 [23:04:03] RECOVERY - Apache HTTP on srv230 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.049 second response time [23:05:27] PROBLEM - check_squid on payments1004 is CRITICAL: PROCS CRITICAL: 1 process with command name squid3 [23:05:27] PROBLEM - check_squid on payments3 is CRITICAL: PROCS CRITICAL: 1 process with command name squid [23:05:27] PROBLEM - check_squid on payments4 is CRITICAL: PROCS CRITICAL: 1 process with command name squid [23:05:27] PROBLEM - check_squid on payments1002 is CRITICAL: PROCS CRITICAL: 1 process with command name squid3 [23:05:27] PROBLEM - check_squid on payments1003 is CRITICAL: PROCS CRITICAL: 1 process with command name squid3 [23:05:27] PROBLEM - check_squid on payments1001 is CRITICAL: PROCS CRITICAL: 1 process with command name squid3 [23:05:27] PROBLEM - check_squid on payments2 is CRITICAL: PROCS CRITICAL: 1 process with command name squid [23:05:27] PROBLEM - check_squid on payments1 is CRITICAL: PROCS CRITICAL: 1 process with command name squid [23:07:11] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31886 [23:07:26] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31888 [23:10:13] ooh, debug packages, fancy [23:10:28] PROBLEM - check_squid on payments1004 is CRITICAL: PROCS CRITICAL: 1 process with command name squid3 [23:10:28] PROBLEM - check_squid on payments1003 is CRITICAL: PROCS CRITICAL: 1 process with command name squid3 [23:10:28] PROBLEM - check_squid on payments1001 is CRITICAL: PROCS CRITICAL: 1 process with command name squid3 [23:10:28] PROBLEM - check_squid on payments1002 is CRITICAL: PROCS CRITICAL: 1 process with command name squid3 [23:10:28] PROBLEM - check_squid on payments4 is CRITICAL: PROCS CRITICAL: 1 process with command name squid [23:10:28] PROBLEM - check_squid on payments3 is CRITICAL: PROCS CRITICAL: 1 process with command name squid [23:10:28] PROBLEM - check_squid on payments2 is CRITICAL: PROCS CRITICAL: 1 process with command name squid [23:10:28] PROBLEM - check_squid on payments1 is CRITICAL: PROCS CRITICAL: 1 process with command name squid [23:11:03] mutante: still some cruft .log files [23:12:20] TimStarling: and I just found out your doc https://wikitech.wikimedia.org/view/GDB_with_PHP [23:14:27] yeah, that's basically what I'm doing, getting currently running function names [23:15:31] RECOVERY - check_squid on payments1 is OK: PROCS OK: 1 process with command name squid [23:15:31] RECOVERY - check_squid on payments1003 is OK: PROCS OK: 1 process with command name squid3 [23:15:31] RECOVERY - check_squid on payments1001 is OK: PROCS OK: 1 process with command name squid3 [23:15:31] RECOVERY - check_squid on payments1004 is OK: PROCS OK: 1 process with command name squid3 [23:15:31] RECOVERY - check_squid on payments1002 is OK: PROCS OK: 1 process with command name squid3 [23:15:32] RECOVERY - check_squid on payments4 is OK: PROCS OK: 1 process with command name squid [23:15:32] RECOVERY - check_squid on payments3 is OK: PROCS OK: 1 process with command name squid [23:15:33] RECOVERY - check_squid on payments2 is OK: PROCS OK: 1 process with command name squid [23:22:06] New patchset: Asher; "remove -n option from php" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31968 [23:23:48] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31968 [23:27:02] New patchset: Reedy; "Few bits of wikivoyage config" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31971 [23:27:12] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31971 [23:27:45] !log reedy synchronized wmf-config/InitialiseSettings.php [23:27:56] Logged the message, Master [23:29:45] New patchset: Reedy; "Add enwikivoyage to dblists" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31973 [23:29:51] it's mostly I/O by the looks of it [23:30:00] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31973 [23:30:23] there's about 4 seconds of system time due to the "declare(ticks=1);" in /usr/share/php/PHP/Invoker.php [23:30:33] i.e. when I comment it out, that system time goes away [23:30:58] that's the origin of the rt_sigprocmask() calls that you have probably seen in strace [23:31:59] New patchset: Reedy; "Disable including of replacetext" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31974 [23:31:59] then there's around 20s of user time, and an additional 30-50 seconds of wall clock time that seems to be mostly fsync() calls while constructing SQLite databases [23:32:20] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31974 [23:32:22] ;) [23:32:31] so that is probably the low hanging fruit [23:34:10] !log reedy synchronized wmf-config/ [23:34:16] Logged the message, Master [23:34:54] open("/home/hashar/mwcore/mediawiki-78a5729/my_wiki.sqlite-journal", O_RDWR|O_CREAT, 0664) = 9 [23:35:00] hmm [23:35:01] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:35:04] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: [23:35:09] Logged the message, Master [23:35:47] when it's called from the web it's at /var/lib/jenkins/jobs/MediaWiki-Tests-Parser/workspace/data/my_wiki.sqlite-journal [23:36:19] or MediaWiki-Tests-Dumps I guess [23:36:50] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: enwikivoyage to 1.21wmf3 [23:36:55] Logged the message, Master [23:37:25] TimStarling: the sqlite db is deleted before each job IIRC [23:38:20] TimStarling: I believe the declare( ticks = 1 ) is to find out when the test has reached sometimeout [23:38:41] I'm trying to run du on /var/lib/jenkins/jobs, it's been running for 2 minutes now [23:39:03] maybe the new sqlite version as some perf regression regarding the way they write the journal [23:39:09] moving it to a tmpfs would be an obvious solution, I'll let you know whether it is feasible when the du finishes [23:39:23] oh du is huge [23:39:34] we have log of all builds since Jenkins has been setup [23:39:39] there's always eatmydata [23:40:03] (apt-cache show eatmydata) [23:41:26] but /var/lib/jenkins/jobs/MediaWiki-Tests-Dumps/workspace is only 312KB [23:41:49] the filesystem is ext3 [23:42:04] can you configure the workspace to be somewhere else? [23:42:09] ext3 is known to sync the whole journal when you fsync an fd [23:42:34] well gallium has just one huge FS [23:43:04] !log reedy synchronized php-1.21wmf3/extensions/UserMerge [23:43:07] Logged the message, Master [23:43:20] /dev/md0 on / type ext3 (rw,errors=remount-ro) 452G [23:43:29] you could potentially also use "PRAGMA journal_mode = OFF;" [23:43:55] that's an SQLite query that I guess would have to be run on each file you create [23:44:05] yeah [23:44:30] iirc sqlite has multiple modes of syncing [23:44:36] depending on your constraints [23:44:51] there was a bug a while back, when firefox switched to sqlite [23:44:55] so maybe the journal_mode has been enabled by default in Precise [23:45:06] firefox used the aggressive sync option, because it didn't want to lose data [23:45:13] ext3 synced the whole journal because that's what it does [23:45:24] well, there is a pragma called fullfsync [23:45:38] and firefox wrote to sqlite everytime you clicked a link or typed something into the "awesomebar" [23:45:46] but the docs say it is off by default and there doesn't seem to be a configure entry to turn it on [23:45:56] so everytime you had firefox open the whole system was dead slow [23:45:58] configure option I mean [23:46:55] eatmydata is a nice shotgun approach to ignore all that [23:48:31] New patchset: Reedy; "Fix casing of addWiki" [operations/mediawiki-multiversion] (master) - https://gerrit.wikimedia.org/r/31977 [23:48:40] " Only Mac OS X supports F_FULLFSYNC" [23:48:43] Change merged: Reedy; [operations/mediawiki-multiversion] (master) - https://gerrit.wikimedia.org/r/31977 [23:48:47] right, it's not that then [23:49:43] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.016 seconds [23:50:24] anyway, sleep now [23:50:30] New patchset: Reedy; "Remove old maintenance script locations" [operations/mediawiki-multiversion] (master) - https://gerrit.wikimedia.org/r/31978 [23:50:31] bye [23:50:48] bye [23:52:41] TimStarling: also I have seen some latencies when writing CDB files [23:52:56] !log reedy synchronized multiversion [23:53:02] Logged the message, Master [23:53:28] TimStarling: anyway that is a bit too late for me right now. If you find anything can you share it on the bug report https://bugzilla.wikimedia.org/show_bug.cgi?id=41607 ? ;) [23:53:42] ok [23:53:44] New patchset: Asher; "remove one more "php -n"" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31979 [23:54:01] TimStarling: and as usual: thanks for your investigations 8-) [23:54:26] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/31979 [23:54:27] New patchset: Reedy; "enwikivoyage to 1.21wmf3" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31980 [23:54:41] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31980 [23:55:44] New patchset: Kaldari; "Turning Echo on for test2.wiki and mediawiki.org" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31981 [23:56:48] Change merged: Kaldari; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/31981