[00:01:40] vvv: bugzilla? ;) [00:01:59] Track it against 38865 [00:02:02] Reedy: well, I cannot be sure the person who deploys code reads bugzilla [00:02:06] Ah, there's a bug [00:02:33] !log started filearchive.fa_sha1 migration, starting with s1 [00:02:44] Logged the message, Master [00:03:11] AaronSchulz: it's nearly long running maintenance script time again! :p [00:04:34] mutante: thanks for the heads up [01:41:20] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 247 seconds [01:49:17] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 4 seconds [02:00:41] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [02:00:41] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [02:00:42] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [02:20:42] !log LocalisationUpdate completed (1.21wmf1) at Wed Oct 10 02:20:42 UTC 2012 [02:20:57] Logged the message, Master [02:34:44] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [02:41:09] !log LocalisationUpdate completed (1.20wmf12) at Wed Oct 10 02:41:09 UTC 2012 [02:41:20] Logged the message, Master [02:52:44] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [03:00:05] RECOVERY - Puppet freshness on spence is OK: puppet ran at Wed Oct 10 02:59:36 UTC 2012 [03:01:42] !log completed fa_sha1 migration [03:01:53] Logged the message, Master [04:01:48] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [04:50:25] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [05:23:34] PROBLEM - Lucene on search1016 is CRITICAL: Connection timed out [05:28:13] RECOVERY - Lucene on search1016 is OK: TCP OK - 9.022 second response time on port 8123 [05:30:10] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [05:34:40] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 9.019 second response time on port 8123 [05:45:49] PROBLEM - Lucene on search1016 is CRITICAL: Connection timed out [05:47:19] RECOVERY - Lucene on search1016 is OK: TCP OK - 9.023 second response time on port 8123 [05:58:52] PROBLEM - Lucene on search1016 is CRITICAL: Connection timed out [06:09:40] RECOVERY - Lucene on search1016 is OK: TCP OK - 0.027 second response time on port 8123 [06:14:01] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [06:19:25] PROBLEM - Lucene on search1016 is CRITICAL: Connection timed out [06:22:25] RECOVERY - Lucene on search1016 is OK: TCP OK - 3.023 second response time on port 8123 [06:24:04] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [06:27:04] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [06:28:06] !log restarted lucene search on search1016 [06:28:20] Logged the message, Master [07:37:51] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [07:37:51] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [07:54:48] PROBLEM - Puppet freshness on sq69 is CRITICAL: Puppet has not run in the last 10 hours [10:52:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:54:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.166 seconds [11:29:01] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:40:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.616 seconds [11:49:44] New patchset: Matthias Mullie; "Deprecated: Use of wfGetIP was deprecated in MediaWiki 1.19." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27404 [12:01:23] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [12:01:23] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [12:01:23] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [12:02:03] snapmirroring of originals will take about 5 days to complete [12:15:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:20:44] !log Increased SnapMirror TCP window size on nas1-a and nas1001-a for the originals snapmirror relationship, and restarted the initial transfer at the last checkpoint [12:20:56] Logged the message, Master [12:29:22] now it's twice as fast [12:30:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.049 seconds [12:35:17] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [12:40:13] !log Throttled SnapMirror originals transfer to 75 MB/s [12:40:23] Logged the message, Master [12:53:17] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [12:55:07] I haven't played with snapmirror, is it easy to set up? [12:55:31] yes [12:55:39] hehe I optimized its throughput [12:55:43] and was then saturating one gige link [12:55:48] so had to throttle it again [12:55:53] at least now we can steer it [12:55:55] haha [12:56:16] it's single source to single destination, so of course it won't balance across the two NICs with 802.3ad [12:56:49] I think we can wait a few more days... :) [12:56:56] I do wonder what to do with thumbs though. [12:56:57] yeah sure [12:57:09] I just want things to be able to go as fast as I want them. [13:03:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:16:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.538 seconds [13:50:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:00] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [14:06:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.026 seconds [14:10:51] meh [14:10:54] memory errors on ms6 [14:25:32] hmm [14:25:44] big increase in swift hits per second [14:25:49] http://ganglia.wikimedia.org/latest/graph_all_periods.php?hreg[]=^ms-fe[1-9].pmtpa&mreg[]=swift_[A-Z]%2B_hits%24&z=large>ype=stack&title=Swift+queries+per+second&aggregate=1&r=hour [14:25:59] New patchset: Reedy; "Some initial configuration for fdcwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27415 [14:26:23] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27415 [14:31:29] odd [14:32:26] no visible change on the upload squids in pmtpa [14:37:15] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: [14:37:20] mw7: rsync: link_stat "/wikiversions.cdb" (in common) failed: Stale NFS file handle (116) [14:37:20] mw7: rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1536) [generator=3.0.9] [14:37:27] Logged the message, Master [14:39:23] !log reedy synchronized wmf-config/ [14:39:34] Logged the message, Master [14:39:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:40:13] swift's spike dropped again [14:40:13] so what made it drop again? [14:40:15] odd [14:40:18] heh [14:43:01] who knows [14:46:22] Could someone with root run this for me on fenari please? rm -rf /home/wikipedia/common/docroot/fdc.old [14:48:00] !log reedy synchronized s3.dblist [14:48:11] Logged the message, Master [14:48:22] Reedy: done [14:48:27] thanks [14:51:00] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [14:52:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.583 seconds [14:52:40] paravoid: do you have a direct link to the graphite url for nfs latency? [14:56:33] https://gdash.wikimedia.org/dashboards/filebackend/ I think [14:56:36] * paravoid looks closer [14:58:11] hmm I expected a spike when i was saturating the GigE [14:58:18] https://gdash.wikimedia.org/dashboards/datastores/ that's interesting [14:58:28] it's about when we had the swift spike, isn't it? [14:58:43] yes [14:58:54] external store gets too [15:04:44] Could someone also please run sync-apache and then apache-graceful-all for me? It looks like Daniel didn't after making some further config updates.. [15:22:12] Hmm, never mind! [15:24:43] !log reedy ran sync-common-all [15:24:54] Logged the message, Master [15:24:59] PROBLEM - MySQL Replication Heartbeat on db1042 is CRITICAL: CRIT replication delay 221 seconds [15:25:08] PROBLEM - MySQL Replication Heartbeat on db38 is CRITICAL: CRIT replication delay 230 seconds [15:25:08] PROBLEM - MySQL Slave Delay on db60 is CRITICAL: CRIT replication delay 229 seconds [15:25:17] PROBLEM - MySQL Replication Heartbeat on db1049 is CRITICAL: CRIT replication delay 240 seconds [15:25:26] PROBLEM - MySQL Replication Heartbeat on db1017 is CRITICAL: CRIT replication delay 248 seconds [15:25:26] PROBLEM - MySQL Replication Heartbeat on db59 is CRITICAL: CRIT replication delay 249 seconds [15:25:35] PROBLEM - MySQL Slave Delay on db38 is CRITICAL: CRIT replication delay 253 seconds [15:25:35] PROBLEM - MySQL Slave Delay on db59 is CRITICAL: CRIT replication delay 253 seconds [15:25:35] PROBLEM - MySQL Replication Heartbeat on db1001 is CRITICAL: CRIT replication delay 258 seconds [15:25:35] PROBLEM - MySQL Replication Heartbeat on db1033 is CRITICAL: CRIT replication delay 258 seconds [15:25:35] PROBLEM - Apache HTTP on mw21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:36] PROBLEM - Apache HTTP on mw20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:36] PROBLEM - Apache HTTP on mw55 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:37] PROBLEM - Apache HTTP on srv268 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:37] PROBLEM - MySQL Replication Heartbeat on db63 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:25:44] PROBLEM - MySQL Replication Heartbeat on db1043 is CRITICAL: CRIT replication delay 266 seconds [15:25:44] PROBLEM - MySQL Replication Heartbeat on db1050 is CRITICAL: CRIT replication delay 266 seconds [15:25:44] PROBLEM - MySQL Slave Delay on db36 is CRITICAL: CRIT replication delay 262 seconds [15:25:44] PROBLEM - MySQL Slave Delay on db1017 is CRITICAL: CRIT replication delay 263 seconds [15:25:44] PROBLEM - Apache HTTP on mw49 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:45] PROBLEM - Apache HTTP on mw58 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:45] PROBLEM - Apache HTTP on mw29 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:46] PROBLEM - Apache HTTP on mw28 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:46] PROBLEM - Apache HTTP on mw44 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:47] PROBLEM - Apache HTTP on mw46 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:53] PROBLEM - Apache HTTP on mw26 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:53] PROBLEM - Apache HTTP on mw52 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:26:20] PROBLEM - MySQL Replication Heartbeat on db60 is CRITICAL: CRIT replication delay 302 seconds [15:26:20] PROBLEM - MySQL Replication Heartbeat on db36 is CRITICAL: CRIT replication delay 302 seconds [15:26:21] PROBLEM - Apache HTTP on mw32 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:26:21] PROBLEM - Apache HTTP on mw18 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:26:21] PROBLEM - Apache HTTP on mw38 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:26:29] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:26:29] PROBLEM - Apache HTTP on mw33 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:26:29] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:26:29] PROBLEM - Apache HTTP on mw17 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:26:29] PROBLEM - Apache HTTP on mw34 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:26:56] RECOVERY - Apache HTTP on srv268 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.976 second response time [15:26:57] whatcha doing reedy [15:27:06] PROBLEM - Apache HTTP on mw51 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:06] PROBLEM - Apache HTTP on mw45 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:13] Currently? Nothing [15:27:14] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:14] PROBLEM - Apache HTTP on mw31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:14] PROBLEM - Apache HTTP on mw42 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:18] no, one minute ago :) [15:27:23] PROBLEM - Apache HTTP on mw43 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:23] PROBLEM - Apache HTTP on mw24 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:23] PROBLEM - Apache HTTP on mw25 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:25] I had just run sync-common-all [15:27:32] PROBLEM - Apache HTTP on mw41 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:32] PROBLEM - Apache HTTP on mw22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:32] PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:41] PROBLEM - Apache HTTP on mw57 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:41] PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:50] that must have upset something [15:27:59] PROBLEM - Apache HTTP on mw54 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:59] PROBLEM - Apache HTTP on mw59 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:59] PROBLEM - Apache HTTP on mw53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:28:00] seems to have come back again... [15:28:08] PROBLEM - Apache HTTP on mw47 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:28:08] PROBLEM - Apache HTTP on mw30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:28:08] PROBLEM - Apache HTTP on mw39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:28:08] PROBLEM - Apache HTTP on mw19 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:28:08] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:28:09] or not.. [15:28:13] you think so? :P [15:28:36] so these were daniel's changes? [15:28:57] can you rollback? [15:29:04] we are down right now [15:29:20] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:29:42] hello? [15:30:02] They were already live last night [15:30:05] I can't change apache config [15:30:17] sync-common-all is just MW files [15:30:29] so you didn't sync anything new? [15:30:31] http://ganglia.wikimedia.org/latest/?c=Application%20servers%20pmtpa&m=load_one&r=hour&s=by%20name&hc=4&mc=2 looks odd [15:30:32] RECOVERY - MySQL Replication Heartbeat on db63 is OK: OK replication delay seconds [15:30:40] New patchset: Ottomata; "Using log2udp to relay udp2logs from oxygen over to Kraken on analytics1011." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27417 [15:30:41] Found it, s3's having issues. [15:31:02] right [15:31:08] PROBLEM - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:31:36] wth? [15:31:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27417 [15:31:40] db63 in trouble [15:31:47] reedy@fenari:/usr/local/apache/common-local$ sql enwiki [15:31:47] show ERROR 1040 (08004): Too many connections [15:31:57] [6363440.564747] Out of memory: Kill process 775 (mysqld) score 982 or sacrifice child [15:31:58] [6363440.572464] Killed process 775 (mysqld) total-vm:109088864kB, anon-rss:96977944kB, file-rss:0kB [15:32:29] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 47585 bytes in 0.403 seconds [15:32:30] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27417 [15:32:40] db63 is s1 though [15:32:47] RECOVERY - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 59487 bytes in 8.802 seconds [15:32:54] The rest look fine on dbtree [15:33:09] mysqld is running [15:33:31] it has finished crash recovery [15:33:58] quite a large cpu spike about 10 minutes ago [15:33:59] PROBLEM - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:34:29] but mysql has been running since yesterday [15:35:17] New patchset: Ottomata; "Reverting Tata Zero log filter back to what it was." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27418 [15:36:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27418 [15:36:57] https://gdash.wikimedia.org/dashboards/datastores/ [15:36:59] RECOVERY - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 39738 bytes in 2.463 seconds [15:37:03] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27418 [15:40:08] RECOVERY - MySQL Slave Delay on db1017 is OK: OK replication delay 20 seconds [15:40:26] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:40:53] RECOVERY - MySQL Slave Delay on db60 is OK: OK replication delay 4 seconds [15:41:02] RECOVERY - MySQL Replication Heartbeat on db59 is OK: OK replication delay 10 seconds [15:41:02] RECOVERY - MySQL Replication Heartbeat on db1017 is OK: OK replication delay 0 seconds [15:41:02] RECOVERY - MySQL Replication Heartbeat on db1049 is OK: OK replication delay 0 seconds [15:41:02] RECOVERY - Apache HTTP on mw32 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.106 second response time [15:41:11] RECOVERY - MySQL Slave Delay on db38 is OK: OK replication delay 0 seconds [15:41:20] RECOVERY - MySQL Replication Heartbeat on db1033 is OK: OK replication delay 0 seconds [15:41:20] RECOVERY - Apache HTTP on mw42 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [15:41:20] RECOVERY - MySQL Replication Heartbeat on db1050 is OK: OK replication delay 0 seconds [15:41:20] RECOVERY - MySQL Slave Delay on db59 is OK: OK replication delay 0 seconds [15:41:20] RECOVERY - Apache HTTP on mw43 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.491 second response time [15:41:29] RECOVERY - MySQL Replication Heartbeat on db1001 is OK: OK replication delay 0 seconds [15:41:29] RECOVERY - MySQL Slave Delay on db36 is OK: OK replication delay 25 seconds [15:41:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.275 seconds [15:41:29] RECOVERY - Apache HTTP on mw58 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.738 second response time [15:41:29] RECOVERY - Apache HTTP on mw44 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [15:41:30] RECOVERY - Apache HTTP on mw51 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.036 second response time [15:41:30] RECOVERY - Apache HTTP on mw52 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.046 second response time [15:41:31] RECOVERY - Apache HTTP on mw29 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time [15:41:31] RECOVERY - Apache HTTP on mw45 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.517 second response time [15:41:32] RECOVERY - Apache HTTP on mw55 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.474 second response time [15:41:32] RECOVERY - Apache HTTP on mw41 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.440 second response time [15:41:33] RECOVERY - Apache HTTP on mw26 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.727 second response time [15:41:33] RECOVERY - Apache HTTP on mw25 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.833 second response time [15:41:34] RECOVERY - Apache HTTP on mw24 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.143 second response time [15:41:34] RECOVERY - Apache HTTP on mw31 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.192 second response time [15:41:35] RECOVERY - Apache HTTP on mw21 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.061 second response time [15:41:35] RECOVERY - Apache HTTP on mw46 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.333 second response time [15:41:38] RECOVERY - Apache HTTP on mw20 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.425 second response time [15:41:38] RECOVERY - Apache HTTP on mw49 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.901 second response time [15:41:38] RECOVERY - MySQL Replication Heartbeat on db1043 is OK: OK replication delay 0 seconds [15:41:47] RECOVERY - Apache HTTP on mw57 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [15:41:47] RECOVERY - Apache HTTP on mw28 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [15:41:47] RECOVERY - Apache HTTP on mw48 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.181 second response time [15:41:56] RECOVERY - MySQL Replication Heartbeat on db60 is OK: OK replication delay 0 seconds [15:41:56] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 47585 bytes in 0.156 seconds [15:41:56] RECOVERY - Apache HTTP on mw22 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.269 second response time [15:41:57] RECOVERY - MySQL Replication Heartbeat on db36 is OK: OK replication delay 0 seconds [15:42:05] RECOVERY - MySQL Replication Heartbeat on db1042 is OK: OK replication delay 0 seconds [15:42:14] RECOVERY - Apache HTTP on mw30 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.048 second response time [15:42:14] RECOVERY - Apache HTTP on mw39 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time [15:42:14] RECOVERY - Apache HTTP on mw53 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.055 second response time [15:42:14] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [15:42:14] RECOVERY - Apache HTTP on mw38 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [15:42:15] RECOVERY - Apache HTTP on mw34 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.043 second response time [15:42:15] RECOVERY - Apache HTTP on mw59 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [15:42:16] RECOVERY - Apache HTTP on mw33 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [15:42:16] RECOVERY - Apache HTTP on mw54 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.045 second response time [15:42:17] RECOVERY - Apache HTTP on mw18 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.044 second response time [15:42:17] RECOVERY - Apache HTTP on mw17 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [15:42:18] RECOVERY - Apache HTTP on mw19 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.047 second response time [15:42:18] RECOVERY - Apache HTTP on mw36 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.059 second response time [15:42:19] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.381 second response time [15:42:19] RECOVERY - Apache HTTP on mw40 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.056 second response time [15:42:20] RECOVERY - MySQL Replication Heartbeat on db38 is OK: OK replication delay 0 seconds [15:42:20] RECOVERY - Apache HTTP on mw47 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.189 second response time [16:15:32] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [16:15:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:29:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.296 seconds [16:31:11] New patchset: Alex Monk; "(bug 40911) Raise account creation throttle for Malayalam Wikipedia workshop" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27422 [16:43:43] !log reedy synchronized wmf-config/InitialiseSettings.php 'Set wgServer for fdcwiki' [16:43:54] Logged the message, Master [16:44:57] notpeter: about? [16:57:41] PROBLEM - Host search32 is DOWN: PING CRITICAL - Packet loss = 100% [16:57:53] !log reedy synchronized wmf-config/InitialiseSettings.php 'Add meta as import source for fdcwiki' [16:58:05] Logged the message, Master [17:00:13] paravoid: Have you setup any new swift containers yet? [17:01:51] no [17:01:53] RECOVERY - Host search32 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [17:02:00] Aaron told me at one point that MW does this automatically [17:02:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:02:35] I seem to recall someone else saying this before, but I know for the last batch of new wiki creations, Ben had to add them manually [17:02:41] AaronSchulz: ^ [17:02:54] There is http://wikitech.wikimedia.org/view/Swift/How_To#Create_a_container linked... [17:03:44] I don't think anyone has to do anything manually [17:05:34] I just uploaded an image to the wiki and thumbnails aren't working [17:05:36] uploads do.. [17:05:53] https://fdc.wikimedia.org/w/thumb.php?f=FDC.png&width=800 [17:05:56] Hmm, that 404s [17:08:10] It's not loading thumb.php [17:08:15] Reedy: sup? [17:08:30] notpeter: can you setup search indexes for fdc.wikimedia.org please? [17:08:39] sure [17:09:19] AaronSchulz: it symlinks back to the same version... [17:10:38] Reedy: do we have any kind of checklist for new wikis? [17:10:51] Yup [17:10:52] http://wikitech.wikimedia.org/view/Add_a_wiki [17:11:29] Hmm, AaronSchulz api.php isn't found either... [17:11:46] Reedy: so, just to be sure, wiki name will be fdcwiki, yes? [17:11:58] database name? yeah [17:12:03] cool [17:12:59] !log reedy synchronized docroot/fdc/ [17:13:11] Logged the message, Master [17:13:25] Well, that fixed the api link... [17:14:45] If someone could also delete /usr/local/apache/common/docroot/fdc/w on srv233, that'd be helpful please [17:14:48] owned by root:root [17:15:12] !log created indices for fdcwiki on searchidx2 and searchidx1001 [17:15:23] Logged the message, notpeter [17:16:06] AFK for dinner [17:16:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.755 seconds [17:18:29] notpeter: the new precise mw1-? job runners are all happy now, right? [17:18:43] binasher: should be [17:18:49] they are running jobs [17:19:08] although, if they actually hate running jobs, the they're very unhappy. [17:19:14] I'll commune with them and let you know. [17:19:22] notpeter: when done with the fdc stuff, can you kill off all the old ones? [17:19:39] sure. why? [17:19:52] also, done with that. it's all of 1 command :) [17:20:32] still too many of them [17:20:36] ok [17:21:24] same thing as a couple weeks ago [17:22:55] is less more? [17:23:11] or is more more? [17:23:14] :) [17:23:50] to be fair, the jobqueue on zhwiki was up to like 3 mil, then I added a bunch more threads and it went down over two days [17:23:57] I can't prove causality, though [17:25:34] binasher: ^ [17:32:07] New patchset: Pyoungmeister; "removing jobrunner from all apaches but new, precise, jobrunner only boxes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27430 [17:33:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27430 [17:34:14] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27430 [17:35:20] PROBLEM - check_minfraud_secondary on payments1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:35:20] PROBLEM - check_minfraud_secondary on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:35:20] PROBLEM - check_minfraud_secondary on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:35:20] PROBLEM - check_minfraud_secondary on payments1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:35:20] PROBLEM - check_minfraud_secondary on payments1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:29] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [17:38:29] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [17:40:17] RECOVERY - check_minfraud_secondary on payments4 is OK: HTTP OK: HTTP/1.1 302 Found - 120 bytes in 0.424 second response time [17:40:17] RECOVERY - check_minfraud_secondary on payments1001 is OK: HTTP OK: HTTP/1.1 302 Found - 120 bytes in 0.299 second response time [17:40:17] RECOVERY - check_minfraud_secondary on payments1004 is OK: HTTP OK: HTTP/1.1 302 Found - 120 bytes in 0.297 second response time [17:40:17] RECOVERY - check_minfraud_secondary on payments1002 is OK: HTTP OK: HTTP/1.1 302 Found - 120 bytes in 0.297 second response time [17:40:17] RECOVERY - check_minfraud_secondary on payments1003 is OK: HTTP OK: HTTP/1.1 302 Found - 120 bytes in 0.298 second response time [17:40:20] binasher: ok, done. [17:40:46] they took our jobs! [17:41:11] binasher: you're destroying job opertunities. I just want you to know that. [17:45:16] did they outsource them? [17:46:13] apergos: the wmf jobrunners are now in a more competitive environment. [17:46:32] you cut their pay??!! [17:46:49] n ono. I fired them so that others will work for less. [17:47:01] that's rought [17:47:09] notpeter: maybe you could have more spin up and turn off as needed [17:47:41] really, we need more job creators. and by that, i mean template changes. [17:47:42] it would be a "flexible labor market" economy [17:47:55] I have a bit of update on the source of the GET requests to ms7 with old-style math urls, but I'm a bit tired to dig any firther tonight... [17:48:26] not even a byte of update? [17:48:31] seems that pages that haven't been edited too recently are still sitting there cached with the lang/projct/math urls [17:48:57] not all of them (may some people force a reload, I dunno) [17:49:03] the caches expire after 30 days right? [17:49:10] yes but that dosn't seem to matter [17:49:19] what I mean is [17:49:20] Reedy & ma rk bumped the epoch the other day [17:49:24] for this reason among other things [17:49:24] I see things that have been revalidated [17:49:38] as 'didn't change since the last-modified date' [17:49:45] and still kept [17:50:48] I'm getting these from ams squids, I haven't tried other ones yet [17:50:56] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:51:14] < Last-Modified: Tue, 17 Apr 2012 13:50:08 GMT [17:51:24] but [17:51:25] < Date: Sat, 22 Sep 2012 17:02:57 GMT [17:51:27] here's the url: [17:51:40] 'http://en.wikipedia.org/wiki/Epimorphism [17:51:51] I do curl -v -H 'Accept-Encoding: gzip,deflate,sdch' to get it [17:52:13] sample bad path: //upload.wikimedia.org/wikipedia/en/math/6/f/f/6ff8dcfec7743192babadf5178662a4e.png [17:52:33] I can dig up other examples as needed, ...tomorrow :-) [17:53:37] ah and it's amssq32 with the hit, for me [17:53:44] for that specific page [17:54:02] New patchset: Pyoungmeister; "re-adding self to nagios (not sure why I wasn't in there...)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27431 [17:54:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27431 [17:55:26] PROBLEM - Puppet freshness on sq69 is CRITICAL: Puppet has not run in the last 10 hours [17:55:32] anyways looking at a few I could not of course guess why some were ok and some aren't [17:57:39] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27431 [17:57:59] !log reedy synchronized php-1.21wmf1/cache/interwiki.cdb 'Update interwiki cache' [17:58:11] binasher: jobqueues are increasing, fyi [17:58:11] Logged the message, Master [17:58:26] I could spin pu the jobrunners in eqiad... [17:58:51] !log reedy synchronized php-1.20wmf12/cache/interwiki.cdb 'Update interwiki cache' [17:59:03]