[00:01:40] vvv: bugzilla? ;) [00:01:59] Track it against 38865 [00:02:02] Reedy: well, I cannot be sure the person who deploys code reads bugzilla [00:02:06] Ah, there's a bug [00:02:33] !log started filearchive.fa_sha1 migration, starting with s1 [00:02:44] Logged the message, Master [00:03:11] AaronSchulz: it's nearly long running maintenance script time again! :p [00:04:34] mutante: thanks for the heads up [01:41:20] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 247 seconds [01:49:17] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 4 seconds [02:00:41] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [02:00:41] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [02:00:42] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [02:20:42] !log LocalisationUpdate completed (1.21wmf1) at Wed Oct 10 02:20:42 UTC 2012 [02:20:57] Logged the message, Master [02:34:44] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [02:41:09] !log LocalisationUpdate completed (1.20wmf12) at Wed Oct 10 02:41:09 UTC 2012 [02:41:20] Logged the message, Master [02:52:44] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [03:00:05] RECOVERY - Puppet freshness on spence is OK: puppet ran at Wed Oct 10 02:59:36 UTC 2012 [03:01:42] !log completed fa_sha1 migration [03:01:53] Logged the message, Master [04:01:48] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [04:50:25] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [05:23:34] PROBLEM - Lucene on search1016 is CRITICAL: Connection timed out [05:28:13] RECOVERY - Lucene on search1016 is OK: TCP OK - 9.022 second response time on port 8123 [05:30:10] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [05:34:40] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 9.019 second response time on port 8123 [05:45:49] PROBLEM - Lucene on search1016 is CRITICAL: Connection timed out [05:47:19] RECOVERY - Lucene on search1016 is OK: TCP OK - 9.023 second response time on port 8123 [05:58:52] PROBLEM - Lucene on search1016 is CRITICAL: Connection timed out [06:09:40] RECOVERY - Lucene on search1016 is OK: TCP OK - 0.027 second response time on port 8123 [06:14:01] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [06:19:25] PROBLEM - Lucene on search1016 is CRITICAL: Connection timed out [06:22:25] RECOVERY - Lucene on search1016 is OK: TCP OK - 3.023 second response time on port 8123 [06:24:04] PROBLEM - LVS Lucene on search-pool4.svc.eqiad.wmnet is CRITICAL: Connection timed out [06:27:04] RECOVERY - LVS Lucene on search-pool4.svc.eqiad.wmnet is OK: TCP OK - 0.027 second response time on port 8123 [06:28:06] !log restarted lucene search on search1016 [06:28:20] Logged the message, Master [07:37:51] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [07:37:51] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [07:54:48] PROBLEM - Puppet freshness on sq69 is CRITICAL: Puppet has not run in the last 10 hours [10:52:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:54:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.166 seconds [11:29:01] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:40:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.616 seconds [11:49:44] New patchset: Matthias Mullie; "Deprecated: Use of wfGetIP was deprecated in MediaWiki 1.19." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27404 [12:01:23] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [12:01:23] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [12:01:23] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [12:02:03] snapmirroring of originals will take about 5 days to complete [12:15:02] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:20:44] !log Increased SnapMirror TCP window size on nas1-a and nas1001-a for the originals snapmirror relationship, and restarted the initial transfer at the last checkpoint [12:20:56] Logged the message, Master [12:29:22] now it's twice as fast [12:30:47] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.049 seconds [12:35:17] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [12:40:13] !log Throttled SnapMirror originals transfer to 75 MB/s [12:40:23] Logged the message, Master [12:53:17] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [12:55:07] I haven't played with snapmirror, is it easy to set up? [12:55:31] yes [12:55:39] hehe I optimized its throughput [12:55:43] and was then saturating one gige link [12:55:48] so had to throttle it again [12:55:53] at least now we can steer it [12:55:55] haha [12:56:16] it's single source to single destination, so of course it won't balance across the two NICs with 802.3ad [12:56:49] I think we can wait a few more days... :) [12:56:56] I do wonder what to do with thumbs though. [12:56:57] yeah sure [12:57:09] I just want things to be able to go as fast as I want them. [13:03:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [13:16:23] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 3.538 seconds [13:50:51] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:03:00] PROBLEM - Puppet freshness on zhen is CRITICAL: Puppet has not run in the last 10 hours [14:06:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.026 seconds [14:10:51] meh [14:10:54] memory errors on ms6 [14:25:32] hmm [14:25:44] big increase in swift hits per second [14:25:49] http://ganglia.wikimedia.org/latest/graph_all_periods.php?hreg[]=^ms-fe[1-9].pmtpa&mreg[]=swift_[A-Z]%2B_hits%24&z=large>ype=stack&title=Swift+queries+per+second&aggregate=1&r=hour [14:25:59] New patchset: Reedy; "Some initial configuration for fdcwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27415 [14:26:23] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27415 [14:31:29] odd [14:32:26] no visible change on the upload squids in pmtpa [14:37:15] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: [14:37:20] mw7: rsync: link_stat "/wikiversions.cdb" (in common) failed: Stale NFS file handle (116) [14:37:20] mw7: rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1536) [generator=3.0.9] [14:37:27] Logged the message, Master [14:39:23] !log reedy synchronized wmf-config/ [14:39:34] Logged the message, Master [14:39:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:40:13] swift's spike dropped again [14:40:13] so what made it drop again? [14:40:15] odd [14:40:18] heh [14:43:01] who knows [14:46:22] Could someone with root run this for me on fenari please? rm -rf /home/wikipedia/common/docroot/fdc.old [14:48:00] !log reedy synchronized s3.dblist [14:48:11] Logged the message, Master [14:48:22] Reedy: done [14:48:27] thanks [14:51:00] PROBLEM - Puppet freshness on ms-be7 is CRITICAL: Puppet has not run in the last 10 hours [14:52:39] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.583 seconds [14:52:40] paravoid: do you have a direct link to the graphite url for nfs latency? [14:56:33] https://gdash.wikimedia.org/dashboards/filebackend/ I think [14:56:36] * paravoid looks closer [14:58:11] hmm I expected a spike when i was saturating the GigE [14:58:18] https://gdash.wikimedia.org/dashboards/datastores/ that's interesting [14:58:28] it's about when we had the swift spike, isn't it? [14:58:43] yes [14:58:54] external store gets too [15:04:44] Could someone also please run sync-apache and then apache-graceful-all for me? It looks like Daniel didn't after making some further config updates.. [15:22:12] Hmm, never mind! [15:24:43] !log reedy ran sync-common-all [15:24:54] Logged the message, Master [15:24:59] PROBLEM - MySQL Replication Heartbeat on db1042 is CRITICAL: CRIT replication delay 221 seconds [15:25:08] PROBLEM - MySQL Replication Heartbeat on db38 is CRITICAL: CRIT replication delay 230 seconds [15:25:08] PROBLEM - MySQL Slave Delay on db60 is CRITICAL: CRIT replication delay 229 seconds [15:25:17] PROBLEM - MySQL Replication Heartbeat on db1049 is CRITICAL: CRIT replication delay 240 seconds [15:25:26] PROBLEM - MySQL Replication Heartbeat on db1017 is CRITICAL: CRIT replication delay 248 seconds [15:25:26] PROBLEM - MySQL Replication Heartbeat on db59 is CRITICAL: CRIT replication delay 249 seconds [15:25:35] PROBLEM - MySQL Slave Delay on db38 is CRITICAL: CRIT replication delay 253 seconds [15:25:35] PROBLEM - MySQL Slave Delay on db59 is CRITICAL: CRIT replication delay 253 seconds [15:25:35] PROBLEM - MySQL Replication Heartbeat on db1001 is CRITICAL: CRIT replication delay 258 seconds [15:25:35] PROBLEM - MySQL Replication Heartbeat on db1033 is CRITICAL: CRIT replication delay 258 seconds [15:25:35] PROBLEM - Apache HTTP on mw21 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:36] PROBLEM - Apache HTTP on mw20 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:36] PROBLEM - Apache HTTP on mw55 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:37] PROBLEM - Apache HTTP on srv268 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:37] PROBLEM - MySQL Replication Heartbeat on db63 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:25:44] PROBLEM - MySQL Replication Heartbeat on db1043 is CRITICAL: CRIT replication delay 266 seconds [15:25:44] PROBLEM - MySQL Replication Heartbeat on db1050 is CRITICAL: CRIT replication delay 266 seconds [15:25:44] PROBLEM - MySQL Slave Delay on db36 is CRITICAL: CRIT replication delay 262 seconds [15:25:44] PROBLEM - MySQL Slave Delay on db1017 is CRITICAL: CRIT replication delay 263 seconds [15:25:44] PROBLEM - Apache HTTP on mw49 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:45] PROBLEM - Apache HTTP on mw58 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:45] PROBLEM - Apache HTTP on mw29 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:46] PROBLEM - Apache HTTP on mw28 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:46] PROBLEM - Apache HTTP on mw44 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:47] PROBLEM - Apache HTTP on mw46 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:53] PROBLEM - Apache HTTP on mw26 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:25:53] PROBLEM - Apache HTTP on mw52 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:26:20] PROBLEM - MySQL Replication Heartbeat on db60 is CRITICAL: CRIT replication delay 302 seconds [15:26:20] PROBLEM - MySQL Replication Heartbeat on db36 is CRITICAL: CRIT replication delay 302 seconds [15:26:21] PROBLEM - Apache HTTP on mw32 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:26:21] PROBLEM - Apache HTTP on mw18 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:26:21] PROBLEM - Apache HTTP on mw38 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:26:29] PROBLEM - Apache HTTP on mw37 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:26:29] PROBLEM - Apache HTTP on mw33 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:26:29] PROBLEM - Apache HTTP on mw36 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:26:29] PROBLEM - Apache HTTP on mw17 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:26:29] PROBLEM - Apache HTTP on mw34 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:26:56] RECOVERY - Apache HTTP on srv268 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.976 second response time [15:26:57] whatcha doing reedy [15:27:06] PROBLEM - Apache HTTP on mw51 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:06] PROBLEM - Apache HTTP on mw45 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:13] Currently? Nothing [15:27:14] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:14] PROBLEM - Apache HTTP on mw31 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:14] PROBLEM - Apache HTTP on mw42 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:18] no, one minute ago :) [15:27:23] PROBLEM - Apache HTTP on mw43 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:23] PROBLEM - Apache HTTP on mw24 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:23] PROBLEM - Apache HTTP on mw25 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:25] I had just run sync-common-all [15:27:32] PROBLEM - Apache HTTP on mw41 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:32] PROBLEM - Apache HTTP on mw22 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:32] PROBLEM - Apache HTTP on mw40 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:41] PROBLEM - Apache HTTP on mw57 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:41] PROBLEM - Apache HTTP on mw48 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:50] that must have upset something [15:27:59] PROBLEM - Apache HTTP on mw54 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:59] PROBLEM - Apache HTTP on mw59 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:27:59] PROBLEM - Apache HTTP on mw53 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:28:00] seems to have come back again... [15:28:08] PROBLEM - Apache HTTP on mw47 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:28:08] PROBLEM - Apache HTTP on mw30 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:28:08] PROBLEM - Apache HTTP on mw39 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:28:08] PROBLEM - Apache HTTP on mw19 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:28:08] PROBLEM - Apache HTTP on mw35 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:28:09] or not.. [15:28:13] you think so? :P [15:28:36] so these were daniel's changes? [15:28:57] can you rollback? [15:29:04] we are down right now [15:29:20] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:29:42] hello? [15:30:02] They were already live last night [15:30:05] I can't change apache config [15:30:17] sync-common-all is just MW files [15:30:29] so you didn't sync anything new? [15:30:31] http://ganglia.wikimedia.org/latest/?c=Application%20servers%20pmtpa&m=load_one&r=hour&s=by%20name&hc=4&mc=2 looks odd [15:30:32] RECOVERY - MySQL Replication Heartbeat on db63 is OK: OK replication delay seconds [15:30:40] New patchset: Ottomata; "Using log2udp to relay udp2logs from oxygen over to Kraken on analytics1011." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27417 [15:30:41] Found it, s3's having issues. [15:31:02] right [15:31:08] PROBLEM - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:31:36] wth? [15:31:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27417 [15:31:40] db63 in trouble [15:31:47] reedy@fenari:/usr/local/apache/common-local$ sql enwiki [15:31:47] show ERROR 1040 (08004): Too many connections [15:31:57] [6363440.564747] Out of memory: Kill process 775 (mysqld) score 982 or sacrifice child [15:31:58] [6363440.572464] Killed process 775 (mysqld) total-vm:109088864kB, anon-rss:96977944kB, file-rss:0kB [15:32:29] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 47585 bytes in 0.403 seconds [15:32:30] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27417 [15:32:40] db63 is s1 though [15:32:47] RECOVERY - LVS HTTP IPv4 on appservers.svc.pmtpa.wmnet is OK: HTTP OK HTTP/1.1 200 OK - 59487 bytes in 8.802 seconds [15:32:54] The rest look fine on dbtree [15:33:09] mysqld is running [15:33:31] it has finished crash recovery [15:33:58] quite a large cpu spike about 10 minutes ago [15:33:59] PROBLEM - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:34:29] but mysql has been running since yesterday [15:35:17] New patchset: Ottomata; "Reverting Tata Zero log filter back to what it was." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27418 [15:36:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27418 [15:36:57] https://gdash.wikimedia.org/dashboards/datastores/ [15:36:59] RECOVERY - LVS HTTP IPv4 on foundation-lb.esams.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 39738 bytes in 2.463 seconds [15:37:03] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27418 [15:40:08] RECOVERY - MySQL Slave Delay on db1017 is OK: OK replication delay 20 seconds [15:40:26] PROBLEM - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [15:40:53] RECOVERY - MySQL Slave Delay on db60 is OK: OK replication delay 4 seconds [15:41:02] RECOVERY - MySQL Replication Heartbeat on db59 is OK: OK replication delay 10 seconds [15:41:02] RECOVERY - MySQL Replication Heartbeat on db1017 is OK: OK replication delay 0 seconds [15:41:02] RECOVERY - MySQL Replication Heartbeat on db1049 is OK: OK replication delay 0 seconds [15:41:02] RECOVERY - Apache HTTP on mw32 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 5.106 second response time [15:41:11] RECOVERY - MySQL Slave Delay on db38 is OK: OK replication delay 0 seconds [15:41:20] RECOVERY - MySQL Replication Heartbeat on db1033 is OK: OK replication delay 0 seconds [15:41:20] RECOVERY - Apache HTTP on mw42 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [15:41:20] RECOVERY - MySQL Replication Heartbeat on db1050 is OK: OK replication delay 0 seconds [15:41:20] RECOVERY - MySQL Slave Delay on db59 is OK: OK replication delay 0 seconds [15:41:20] RECOVERY - Apache HTTP on mw43 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 4.491 second response time [15:41:29] RECOVERY - MySQL Replication Heartbeat on db1001 is OK: OK replication delay 0 seconds [15:41:29] RECOVERY - MySQL Slave Delay on db36 is OK: OK replication delay 25 seconds [15:41:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.275 seconds [15:41:29] RECOVERY - Apache HTTP on mw58 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.738 second response time [15:41:29] RECOVERY - Apache HTTP on mw44 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [15:41:30] RECOVERY - Apache HTTP on mw51 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.036 second response time [15:41:30] RECOVERY - Apache HTTP on mw52 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.046 second response time [15:41:31] RECOVERY - Apache HTTP on mw29 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time [15:41:31] RECOVERY - Apache HTTP on mw45 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 9.517 second response time [15:41:32] RECOVERY - Apache HTTP on mw55 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.474 second response time [15:41:32] RECOVERY - Apache HTTP on mw41 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.440 second response time [15:41:33] RECOVERY - Apache HTTP on mw26 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.727 second response time [15:41:33] RECOVERY - Apache HTTP on mw25 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.833 second response time [15:41:34] RECOVERY - Apache HTTP on mw24 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.143 second response time [15:41:34] RECOVERY - Apache HTTP on mw31 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 1.192 second response time [15:41:35] RECOVERY - Apache HTTP on mw21 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.061 second response time [15:41:35] RECOVERY - Apache HTTP on mw46 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 7.333 second response time [15:41:38] RECOVERY - Apache HTTP on mw20 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.425 second response time [15:41:38] RECOVERY - Apache HTTP on mw49 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 8.901 second response time [15:41:38] RECOVERY - MySQL Replication Heartbeat on db1043 is OK: OK replication delay 0 seconds [15:41:47] RECOVERY - Apache HTTP on mw57 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [15:41:47] RECOVERY - Apache HTTP on mw28 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.032 second response time [15:41:47] RECOVERY - Apache HTTP on mw48 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.181 second response time [15:41:56] RECOVERY - MySQL Replication Heartbeat on db60 is OK: OK replication delay 0 seconds [15:41:56] RECOVERY - LVS HTTPS IPv4 on wikiversity-lb.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.1 200 OK - 47585 bytes in 0.156 seconds [15:41:56] RECOVERY - Apache HTTP on mw22 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.269 second response time [15:41:57] RECOVERY - MySQL Replication Heartbeat on db36 is OK: OK replication delay 0 seconds [15:42:05] RECOVERY - MySQL Replication Heartbeat on db1042 is OK: OK replication delay 0 seconds [15:42:14] RECOVERY - Apache HTTP on mw30 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.048 second response time [15:42:14] RECOVERY - Apache HTTP on mw39 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.034 second response time [15:42:14] RECOVERY - Apache HTTP on mw53 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.055 second response time [15:42:14] RECOVERY - Apache HTTP on mw35 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [15:42:14] RECOVERY - Apache HTTP on mw38 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.027 second response time [15:42:15] RECOVERY - Apache HTTP on mw34 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.043 second response time [15:42:15] RECOVERY - Apache HTTP on mw59 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.031 second response time [15:42:16] RECOVERY - Apache HTTP on mw33 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.029 second response time [15:42:16] RECOVERY - Apache HTTP on mw54 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.045 second response time [15:42:17] RECOVERY - Apache HTTP on mw18 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.044 second response time [15:42:17] RECOVERY - Apache HTTP on mw17 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.033 second response time [15:42:18] RECOVERY - Apache HTTP on mw19 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.047 second response time [15:42:18] RECOVERY - Apache HTTP on mw36 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.059 second response time [15:42:19] RECOVERY - Apache HTTP on mw37 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.381 second response time [15:42:19] RECOVERY - Apache HTTP on mw40 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.056 second response time [15:42:20] RECOVERY - MySQL Replication Heartbeat on db38 is OK: OK replication delay 0 seconds [15:42:20] RECOVERY - Apache HTTP on mw47 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 2.189 second response time [16:15:32] PROBLEM - Puppet freshness on neon is CRITICAL: Puppet has not run in the last 10 hours [16:15:32] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [16:29:29] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.296 seconds [16:31:11] New patchset: Alex Monk; "(bug 40911) Raise account creation throttle for Malayalam Wikipedia workshop" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27422 [16:43:43] !log reedy synchronized wmf-config/InitialiseSettings.php 'Set wgServer for fdcwiki' [16:43:54] Logged the message, Master [16:44:57] notpeter: about? [16:57:41] PROBLEM - Host search32 is DOWN: PING CRITICAL - Packet loss = 100% [16:57:53] !log reedy synchronized wmf-config/InitialiseSettings.php 'Add meta as import source for fdcwiki' [16:58:05] Logged the message, Master [17:00:13] paravoid: Have you setup any new swift containers yet? [17:01:51] no [17:01:53] RECOVERY - Host search32 is UP: PING OK - Packet loss = 0%, RTA = 0.50 ms [17:02:00] Aaron told me at one point that MW does this automatically [17:02:29] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:02:35] I seem to recall someone else saying this before, but I know for the last batch of new wiki creations, Ben had to add them manually [17:02:41] AaronSchulz: ^ [17:02:54] There is http://wikitech.wikimedia.org/view/Swift/How_To#Create_a_container linked... [17:03:44] I don't think anyone has to do anything manually [17:05:34] I just uploaded an image to the wiki and thumbnails aren't working [17:05:36] uploads do.. [17:05:53] https://fdc.wikimedia.org/w/thumb.php?f=FDC.png&width=800 [17:05:56] Hmm, that 404s [17:08:10] It's not loading thumb.php [17:08:15] Reedy: sup? [17:08:30] notpeter: can you setup search indexes for fdc.wikimedia.org please? [17:08:39] sure [17:09:19] AaronSchulz: it symlinks back to the same version... [17:10:38] Reedy: do we have any kind of checklist for new wikis? [17:10:51] Yup [17:10:52] http://wikitech.wikimedia.org/view/Add_a_wiki [17:11:29] Hmm, AaronSchulz api.php isn't found either... [17:11:46] Reedy: so, just to be sure, wiki name will be fdcwiki, yes? [17:11:58] database name? yeah [17:12:03] cool [17:12:59] !log reedy synchronized docroot/fdc/ [17:13:11] Logged the message, Master [17:13:25] Well, that fixed the api link... [17:14:45] If someone could also delete /usr/local/apache/common/docroot/fdc/w on srv233, that'd be helpful please [17:14:48] owned by root:root [17:15:12] !log created indices for fdcwiki on searchidx2 and searchidx1001 [17:15:23] Logged the message, notpeter [17:16:06] AFK for dinner [17:16:35] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 8.755 seconds [17:18:29] notpeter: the new precise mw1-? job runners are all happy now, right? [17:18:43] binasher: should be [17:18:49] they are running jobs [17:19:08] although, if they actually hate running jobs, the they're very unhappy. [17:19:14] I'll commune with them and let you know. [17:19:22] notpeter: when done with the fdc stuff, can you kill off all the old ones? [17:19:39] sure. why? [17:19:52] also, done with that. it's all of 1 command :) [17:20:32] still too many of them [17:20:36] ok [17:21:24] same thing as a couple weeks ago [17:22:55] is less more? [17:23:11] or is more more? [17:23:14] :) [17:23:50] to be fair, the jobqueue on zhwiki was up to like 3 mil, then I added a bunch more threads and it went down over two days [17:23:57] I can't prove causality, though [17:25:34] binasher: ^ [17:32:07] New patchset: Pyoungmeister; "removing jobrunner from all apaches but new, precise, jobrunner only boxes" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27430 [17:33:03] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27430 [17:34:14] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27430 [17:35:20] PROBLEM - check_minfraud_secondary on payments1003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:35:20] PROBLEM - check_minfraud_secondary on payments1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:35:20] PROBLEM - check_minfraud_secondary on payments4 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:35:20] PROBLEM - check_minfraud_secondary on payments1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:35:20] PROBLEM - check_minfraud_secondary on payments1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:38:29] PROBLEM - Puppet freshness on magnesium is CRITICAL: Puppet has not run in the last 10 hours [17:38:29] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [17:40:17] RECOVERY - check_minfraud_secondary on payments4 is OK: HTTP OK: HTTP/1.1 302 Found - 120 bytes in 0.424 second response time [17:40:17] RECOVERY - check_minfraud_secondary on payments1001 is OK: HTTP OK: HTTP/1.1 302 Found - 120 bytes in 0.299 second response time [17:40:17] RECOVERY - check_minfraud_secondary on payments1004 is OK: HTTP OK: HTTP/1.1 302 Found - 120 bytes in 0.297 second response time [17:40:17] RECOVERY - check_minfraud_secondary on payments1002 is OK: HTTP OK: HTTP/1.1 302 Found - 120 bytes in 0.297 second response time [17:40:17] RECOVERY - check_minfraud_secondary on payments1003 is OK: HTTP OK: HTTP/1.1 302 Found - 120 bytes in 0.298 second response time [17:40:20] binasher: ok, done. [17:40:46] they took our jobs! [17:41:11] binasher: you're destroying job opertunities. I just want you to know that. [17:45:16] did they outsource them? [17:46:13] apergos: the wmf jobrunners are now in a more competitive environment. [17:46:32] you cut their pay??!! [17:46:49] n ono. I fired them so that others will work for less. [17:47:01] that's rought [17:47:09] notpeter: maybe you could have more spin up and turn off as needed [17:47:41] really, we need more job creators. and by that, i mean template changes. [17:47:42] it would be a "flexible labor market" economy [17:47:55] I have a bit of update on the source of the GET requests to ms7 with old-style math urls, but I'm a bit tired to dig any firther tonight... [17:48:26] not even a byte of update? [17:48:31] seems that pages that haven't been edited too recently are still sitting there cached with the lang/projct/math urls [17:48:57] not all of them (may some people force a reload, I dunno) [17:49:03] the caches expire after 30 days right? [17:49:10] yes but that dosn't seem to matter [17:49:19] what I mean is [17:49:20] Reedy & ma rk bumped the epoch the other day [17:49:24] for this reason among other things [17:49:24] I see things that have been revalidated [17:49:38] as 'didn't change since the last-modified date' [17:49:45] and still kept [17:50:48] I'm getting these from ams squids, I haven't tried other ones yet [17:50:56] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [17:51:14] < Last-Modified: Tue, 17 Apr 2012 13:50:08 GMT [17:51:24] but [17:51:25] < Date: Sat, 22 Sep 2012 17:02:57 GMT [17:51:27] here's the url: [17:51:40] 'http://en.wikipedia.org/wiki/Epimorphism [17:51:51] I do curl -v -H 'Accept-Encoding: gzip,deflate,sdch' to get it [17:52:13] sample bad path: //upload.wikimedia.org/wikipedia/en/math/6/f/f/6ff8dcfec7743192babadf5178662a4e.png [17:52:33] I can dig up other examples as needed, ...tomorrow :-) [17:53:37] ah and it's amssq32 with the hit, for me [17:53:44] for that specific page [17:54:02] New patchset: Pyoungmeister; "re-adding self to nagios (not sure why I wasn't in there...)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27431 [17:54:55] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27431 [17:55:26] PROBLEM - Puppet freshness on sq69 is CRITICAL: Puppet has not run in the last 10 hours [17:55:32] anyways looking at a few I could not of course guess why some were ok and some aren't [17:57:39] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27431 [17:57:59] !log reedy synchronized php-1.21wmf1/cache/interwiki.cdb 'Update interwiki cache' [17:58:11] binasher: jobqueues are increasing, fyi [17:58:11] Logged the message, Master [17:58:26] I could spin pu the jobrunners in eqiad... [17:58:51] !log reedy synchronized php-1.20wmf12/cache/interwiki.cdb 'Update interwiki cache' [17:59:03] Logged the message, Master [18:00:58] mutante: about? [18:02:32] binasher: by turning off the lucid ones, it was roughly a 65% decrese in number of jobrunners (from about 600 to about 200) [18:02:53] there are 16 hosts in eqiad marked for jobrunning. [18:04:05] Can someone please delete /usr/local/apache/common/docroot/fdc/w on srv233 [18:04:42] Reedy: for you i will [18:05:21] Reedy: done [18:05:28] thanks! :) [18:06:41] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.027 seconds [18:10:57] notpeter: how many are running on each of the dedicated hosts? [18:11:10] looks like 6? [18:11:22] should be 12 [18:12:50] * Reedy pokes AaronSchulz [18:13:05] Any idea about these thumb.php four oh fours? [18:13:42] binasher: ah, no. they *should be* 12. but are only 5 [18:13:53] let me turn them up, yes? [18:16:09] New patchset: Pyoungmeister; "increasing number of threads on jobrunner boxes from 5 to 12" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27437 [18:17:25] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27437 [18:18:28] What the hell are zhwiki doing? :/ [18:18:45] template edits [18:19:01] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27437 [18:19:08] notpeter: release the jobs [18:19:10] Everyone knows how much servers love edits to widly used templates :D [18:19:29] i wonder if the zhwiki edits are from a bot [18:19:42] template revert war? [18:20:44] job queue on zhwiki out of control or something? [18:20:55] back over 400,000 [18:21:16] wake me up when they get over 1,000,000 ;-) [18:21:28] Don't go into a deep sleep... [18:21:53] Is 20min long enough to fall asleep in? [18:22:20] http://zh.wikipedia.org/w/index.php?namespace=10&tagfilter=&limit=500&title=Special%3A%E6%9C%80%E8%BF%91%E6%9B%B4%E6%94%B9 [18:22:23] Not that many template edits.. [18:22:56] Chrome has been sat trying to translate that page for the best part of 5min heh [18:23:21] has anyone tried to ask Liangent what's going on? [18:24:21] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Rest of wikipedia projects to 1.21wmf1 [18:24:33] Logged the message, Master [18:26:37] binasher: just to be clear: did your coment mean yes to spinning up eqiad jobrunners as well? [18:26:49] Reedy: that zh template edit list doesn't seem to sync up with the templates the zhwiki refreshLinks2 jobs are for [18:26:57] notpeter: no [18:27:00] lol. [18:27:03] ok, didn't think so... [18:28:06] And the jobs are using redirects [18:31:24] 1 Warning: PHP Startup: apc.shm_size now uses M/G suffixes, please update your ini files in Unknown on line 0 [18:32:05] 10.0.11.16 [18:32:11] mw16 [18:32:30] Reedy: just fixed, actually [18:32:38] <^demon> php.ini isn't on noc.wm.o anymore :( [18:32:50] dpkg was fucked, breaking puppet runs [18:34:10] ah [18:34:28] anywho, should be fixed now. lemme know if that error pops up again [18:34:40] willdo [18:36:18] it would be nice if Job::batchInsert logged what action invoked it [18:39:14] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:39:49] wow. load on apaches is *so much more reasonable* now [18:43:40] binasher: I would like to increase load in pybal on mw* apaches by 50% as they're way under utilized now. sound legit? [18:43:51] this is what I'm going by: http://ganglia.wikimedia.org/latest/?r=2hr&cs=&ce=&m=load_one&s=by+name&c=Application+servers+pmtpa&h=&host_regex=&max_graphs=0&tab=m&vn=&sh=1&z=small&hc=4 [18:44:56] notpeter: what changed ? [18:44:57] :-) [18:45:10] setting up APC is useful, someone said? [18:45:38] LeslieCarr: turned off jobrunners on apaches and increased number of jobrunner threads on dedicated jobrunner boxes [18:45:47] sweet [18:45:57] all that load went here: http://ganglia.wikimedia.org/latest/?c=Jobrunners%20pmtpa&m=load_one&r=hour&s=by%20name&hc=4&mc=2 [18:46:01] yeah the mw boxes totally look underused right now -- good job :) [18:46:06] !log storage3 power cycling [18:46:15] yay [18:46:18] Logged the message, Master [18:46:36] much more reasonable, and less messy tangle of roles. hurray! [18:48:00] yoyooooos [18:48:15] i'm looking into setting up ganglia hadoop metrics reporting [18:48:34] i haven't used much ganglia before, so I got a q or 2 [18:48:44] from this page: http://wiki.apache.org/hadoop/GangliaMetrics [18:48:51] Replace @GANGLIA@ with the hostname or IP of the ganglia endpoint. Depending on your Ganglia configuration, the host or IP you use might be: [18:48:51] • The value in the udp_send_channel you use in your /etc/gmond.conf (look for the line which says mcast_join=). [18:48:51] • The gmetad server [18:48:51] • Localhost [18:48:59] gmond is running on the analytics nodes [18:49:02] shoudl I just use localhost? [18:51:50] PROBLEM - SSH on storage3 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [18:53:08] so APC was fucked again on the appservers? [18:53:11] PROBLEM - MySQL Slave Delay on storage3 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:54:21] New patchset: Pyoungmeister; "setting extension distributor cron to run at hour 3, not every minute in hour 3" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27440 [18:54:41] useful that [18:54:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.023 seconds [18:54:54] hm? [18:55:14] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27440 [18:55:16] notpeter: yay a thousand times yay [18:57:25] Reedy: now i am [18:58:18] gotcha, Leslie already took care of it i see, thx [19:07:27] mutante: last known problem for fdcwiki is accessing thumb.php is giving a 404 :( [19:24:41] PROBLEM - Host storage3 is DOWN: PING CRITICAL - Packet loss = 100% [19:27:23] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:28:05] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27440 [19:31:57] Anyone any suggestions on how to work out why https://fdc.wikimedia.org/w/thumb.php is 404'ing? [19:37:09] New patchset: Demon; "Tweak extension distributor to use new /mnt/originals/ location" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27443 [19:41:20] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 7.789 seconds [19:47:42] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27443 [19:53:02] $wgDebugLogGroups['404'] = "udp://$wmfUdp2logDest/four-oh-four"; [19:54:23] lol [19:55:25] notpeter: were is the apache/squid rule that maps /w/ -> live1.5/ ? [19:55:30] * AaronSchulz hates how he can't find stuff [19:55:42] AaronSchulz: apache-config? [19:55:58] I guess it's an apache alias [19:56:26] Hm.. no we use alias for / /wiki index.php [19:58:28] yeah, I was looking there [19:58:41] that's odd [19:58:49] not in puppet files either [19:59:50] AaronSchulz: ah, it is [20:00:04] well, not in puppet [20:00:09] but in mediawiki-config [20:00:16] # Primary wiki redirector: [20:00:17] Alias /wiki /usr/local/apache/common/docroot/mediawiki/w/index.php [20:00:17] RewriteRule ^/$ /w/index.php [20:00:18] RewriteRule ^/w/$ /w/index.php [20:00:22] New patchset: Asher; "reduce innodb buffer pool to 75% of total ram" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27447 [20:00:29] which means /w is aliased from mediawiki-config docroot [20:00:46] I guess we hardcode/repeat the symlink for each docroot? [20:01:32] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27447 [20:02:22] AaronSchulz: yep, we copy docroot/skel-1.5/ for each domain, which contains it [20:02:49] 'w@ -> /apache/common/live-1.5' that's not very maintainable :-/ I guess it makes sense though on a lower level [20:03:26] It's very rare we have to do it [20:03:52] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27447 [20:04:09] wouldn't it make sense to alias it from the apache conf files instead? that's also configurable per-domain but without having to repeat it for each domain [20:04:18] .. in the file system [20:04:26] I don't think it really makes that much of a difference [20:05:00] it saves dozens of symlinks that all point to the same thing [20:05:02] i'd like to do a quick sync-file for something in MobileFrontend - any reason i should hold off? [20:05:16] And? It's not like they take up that much space [20:05:26] awjr: feel free, I think [20:05:38] Reedy: like Alias /wiki /usr/local/apache/common/live-1.5 [20:05:53] /w * [20:06:12] the docroot folder is a whole 21M!!! [20:06:17] Reedy: sure, the space is ok. but its easier to find / maintain. [20:06:39] Submit a commit to fix it? And then find someone in ops who dares to deploy it ;) [20:06:43] !log awjrichards synchronized php-1.21wmf1/extensions/MobileFrontend/javascripts/common/mf-application.js 'touchfile' [20:06:44] also considering that the docroot is pointing to paths maintained outside the control of that repository [20:06:55] Logged the message, Master [20:14:34] !log flipped payments.wikimedia.org to the new eqiad cluster, live testing [20:14:46] Logged the message, Master [20:15:48] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:46] !log returning payments to pmtpa [20:20:51] New patchset: Dzahn; "fix wrong number of columns in tables without language column, remove milestones from wikipedia table,.." [operations/debs/wikistats] (master) - https://gerrit.wikimedia.org/r/27448 [20:20:57] Logged the message, Master [20:21:27] Change merged: Dzahn; [operations/debs/wikistats] (master) - https://gerrit.wikimedia.org/r/27448 [20:22:37] binasher: re what I was saying on friday; caching for donation stuff does seem to be working as intended... (so you were correct); but -- do you happen to have a moment to talk to me about how the Vary header works? [20:28:27] New patchset: Cmcmahon; "Adding a comment for demo purposes, +2 or -2, doesn't matter" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/27449 [20:29:54] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.275 seconds [20:43:27] !log draining esams via authdns-scenario for upgrade [20:43:38] Logged the message, Mistress of the network gear. [20:59:01] LeslieCarr: are you now a vampire? [20:59:19] a bitvampire [21:00:06] I knew it! [21:01:24] PROBLEM - SSH on amslvs1 is CRITICAL: Server answer: [21:02:45] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:02:55] RECOVERY - SSH on amslvs1 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [21:06:12] PROBLEM - Frontend Squid HTTP on sq43 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:07:33] RECOVERY - Frontend Squid HTTP on sq43 is OK: HTTP OK HTTP/1.0 200 OK - 604 bytes in 0.002 seconds [21:09:03] PROBLEM - LVS HTTP IPv4 on upload.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:09:24] er [21:10:24] RECOVERY - LVS HTTP IPv4 on upload.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 606 bytes in 3.437 seconds [21:11:22] wth is going on [21:11:36] PROBLEM - Frontend Squid HTTP on sq85 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:11:45] I *think* all traffic going to pmtpa is overloading those squids [21:11:46] LeslieCarr's drainage? [21:11:57] could be, i can roll back [21:11:58] there's a fair amount of wait pmtpa squids [21:12:07] can we try that? [21:12:12] unless mark says otherwise [21:12:46] ah [21:12:48] let me check [21:13:01] might just be totally cold caches [21:13:03] load on upload squids has doubled [21:13:07] RECOVERY - Frontend Squid HTTP on sq85 is OK: HTTP OK HTTP/1.0 200 OK - 606 bytes in 9.140 seconds [21:13:33] of course, if you don't mind me upgrading the switch i'll do it asap and it will be another 7-10 minutes [21:14:30] looks stable now... [21:14:36] PROBLEM - Host db1008 is DOWN: PING CRITICAL - Packet loss = 100% [21:15:39] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 214 seconds [21:17:18] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay NULL seconds [21:19:51] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.030 seconds [21:20:00] PROBLEM - LVS HTTP IPv4 on upload.pmtpa.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:20:18] PROBLEM - Frontend Squid HTTP on sq43 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:20:27] PROBLEM - Frontend Squid HTTP on sq44 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:20:49] !log undraining esams via authdns-scenario normal [21:21:00] Logged the message, Mistress of the network gear. [21:21:30] RECOVERY - LVS HTTP IPv4 on upload.pmtpa.wikimedia.org is OK: HTTP OK HTTP/1.0 200 OK - 606 bytes in 7.740 seconds [21:21:48] RECOVERY - Frontend Squid HTTP on sq43 is OK: HTTP OK HTTP/1.0 200 OK - 606 bytes in 5.268 seconds [21:21:57] RECOVERY - Frontend Squid HTTP on sq44 is OK: HTTP OK HTTP/1.0 200 OK - 606 bytes in 0.012 seconds [21:29:36] is there an awareness that images seem to be taking ages to load on all our wikis? or... is this just limited to the 6th floor corner where I'm located? [21:29:47] heh [21:30:56] actually; scratch the above; whatever it was I was seeing appears to have sorted itself [21:32:48] mwalker: yes [21:32:53] mwalker: we had a short outage [21:33:18] Ryan_Lane: yay lawnmowers :) [21:33:29] heh. well, not lawnmowers this time :) [21:34:00] weedwhacker? [21:34:28] we moved traffic and had an overload [21:34:33] then moved it back [21:34:36] no, dns-wacker ;) [21:52:33] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:02:27] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [22:02:27] PROBLEM - Puppet freshness on virt1003 is CRITICAL: Puppet has not run in the last 10 hours [22:02:27] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [22:05:21] !log apt-get dist-upgrade on marmontel (blog) [22:05:35] Logged the message, Master [22:08:09] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.038 seconds [22:15:39] PROBLEM - Host db1013 is DOWN: PING CRITICAL - Packet loss = 100% [22:16:42] RECOVERY - Host db1013 is UP: PING OK - Packet loss = 0%, RTA = 26.54 ms [22:26:35] New patchset: Jgreen; "adding db1013 as a potential replacement for db1008 (fundraising db master)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27476 [22:27:36] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27476 [22:28:10] TimStarling: Do you know who's in charge of Wikitech wiki these days? It's running 1.17, so I'm trying to get a bug resolved about upgrading it, but I'm not sure who to bother these days. [22:28:26] nobody [22:28:30] that's why it's running 1.17 [22:28:34] Heh. [22:28:56] mutante tried to upgrade it earlier and failed [22:28:58] It's hardly alone: https://meta.wikimedia.org/wiki/Services [22:29:13] Wikimedia can't upgrade a simple MediaWiki installation? :-) [22:29:28] He bugged me about it a few weeks ago, I forget how it broke [22:30:09] https://bugzilla.wikimedia.org/show_bug.cgi?id=37317 [22:30:13] And now that I'm back in school and working 3 days a week, I don't really have time to upgrade a MW instance on a random-ass Linode instance [22:30:15] I marked it "ops" today. [22:30:22] I didn't know about mutante's work. [22:30:30] Right. [22:30:40] I didn't CC you on the bug for a reason. I assume you'll be rather busy. [22:30:40] He did .... something. But I don't really know the details [22:30:55] He'll undoubtedly see this in scrollback. :_0 [22:30:56] He asked for help but I mostly didn't want anything to do with it because I was (and am) busy [22:30:58] Oop, :-) [22:31:08] That's fair. [22:31:36] 1.17 really is getting a bit old, though. If there are blockers, they should be at least noted somewhere. [22:32:21] I was on it today and I'm like "oh right diff color changed". [22:32:47] Hah [22:32:54] That's not the only thing that changed since 1.17 [22:34:12] I tried to make an editnotice on that wiki today, but the MediaWiki namespace just glared at me. [22:35:43] Brooke: we want to setup a new instance on precise and install a fresh wiki and import data. in between we were waiting for an openstack instance as opposed to linode [22:36:02] Oooooh OK [22:36:13] mutante: Cool. Can you comment on that bug, just so that doesn't get lost. :-) [22:36:23] Or if there's an RT ticket, it could be cross-referenced. [22:36:31] ok, will link them to each other [22:36:37] Awesome, thank you. [22:36:40] PROBLEM - Puppet freshness on cp1040 is CRITICAL: Puppet has not run in the last 10 hours [22:36:43] np [22:39:34] Jeff_Green: Ok, onsite now [22:39:39] any change since i left? [22:40:15] seems db1008n mgmt is still offline, but it does have a login prompt on console [22:40:24] a locked up console [22:40:33] ok, im going to remove all power to see if it doesnt resolve the drac lockup [22:40:43] !log db1008 being cold reset (removing power cords) [22:40:54] Logged the message, RobH [22:41:01] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:46:06] RobH: no changes [22:46:08] !log reboot seems to have restored system (is posting) but drac still not responding, possible bad cable, will test shortly [22:46:19] Logged the message, RobH [22:48:31] RECOVERY - Host db1008 is UP: PING OK - Packet loss = 0%, RTA = 26.69 ms [22:49:04] Jeff_Green: so its up, but mgmt is still down [22:49:08] going to investigate it now [22:49:41] ok. are you expecting to shut it down again? [22:50:37] now I have a new theory [22:51:10] exploding management switch takes down server :-P [22:51:55] Jeff_Green: i may have to yes [22:51:58] PROBLEM - mysqld processes on db1008 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [22:52:02] i replaced the mgmt cable, and it also doesnt work [22:52:05] RobH: k. no problem [22:52:08] which makes me think i need to ensure the drac is seated [22:52:52] mysql isn't configured to start on boot, I'll leave it down until you're done [22:54:05] heh, Jeff_Green you can bring it back online [22:54:08] its bad ports on mgmt switch [22:54:15] move orignal cable to different set of ports, it works [22:54:18] so bad mgmt switch [22:54:18] ah HA [22:54:27] ok cool [22:54:28] we may have to rethink the cheap mgmt switch thing [22:54:36] which cheap switches do we buy? [22:54:39] i'll drop a ticket to fix the rest by replacing the mgmt switch [22:54:40] PROBLEM - Puppet freshness on analytics1001 is CRITICAL: Puppet has not run in the last 10 hours [22:54:43] cheapo 48 port netgear [22:54:56] each rack level mgmt switch is uplinked to the core mgmt switch per datacenter [22:54:56] we used to buy some less cheapo HPs that we're ok [22:54:58] which is a juniper [22:55:12] yeah [22:55:19] is there only one switch for this rack? [22:55:26] but we should look at getting proper redundant power supply and higher quality mgmt racklevel switches [22:55:29] yep [22:55:32] you know, i've *never* seen a port go dead on a netgear? [22:55:37] i have. [22:55:39] quite a few. [22:55:39] we used to lose ports on our cisco gear all the time [22:55:46] i believe it, it's just funny how that works [22:55:51] netgear = never [22:55:55] i have had good luck with foundry [22:55:56] linksys = used to crash a lot [22:56:04] foundry = power supplies fried all the time [22:56:11] and our juniper stuff works, its just picky as all shit on what it will work with. [22:56:13] force10 = line cards failed [22:56:16] yeah [22:56:26] well, i have fixed the issue! [22:56:27] \o/ [22:56:36] well, fixed the symptom that was blocking work [22:56:41] YAYYY! I hereby transfer several of the beers other people owe me to you! [22:56:46] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.037 seconds [22:57:00] it actually feels like there may have been an 'event' on this rack [22:57:03] power spike or something [22:58:00] maybe, will look at graphs once they all aggregate on torrus and im not on site [22:58:06] yeah [22:58:12] lessee if it has any info that jumps out [22:58:29] Jeff_Green: lemme know when mysql is back up and I will bring the pipeline back up [22:58:42] thanks a million RobH [22:58:46] RobH: last log was at 21:05:02 [22:58:53] i dont see anything really standing out on torrus for the power strip [22:58:59] http://torrus.wikimedia.org/torrus/Facilities?nodeid=device//ps1-a2-eqiad [22:59:15] heh, its late, cuz if 40 folks click that link, torrus will take a dive. [22:59:16] a spike would be too short for a power strip to pick up though [22:59:20] so hopefully they wont. [22:59:50] they have logging [22:59:53] but we of course dont have it on [23:00:01] perhaps we should think about turning those on. [23:00:15] oh, it stores it locally, looking now [23:00:18] pgehres: mysql's starting but it will take a while to crash recover [23:00:34] logs shows nothing [23:00:38] RobH: it won't [23:00:48] not sure how detailed it would be for volt/amp flux [23:00:53] those things sample every X seconds or whatever [23:01:13] oh well, switch needs replacing, i dropped a ticket to take care of tomorrow [23:01:16] RECOVERY - mysqld processes on db1008 is OK: PROCS OK: 1 process with command name mysqld [23:01:24] Jeff_Green: so ya good? [23:01:28] yes--thanks robh [23:02:10] !log msw-a2-eqiad has a bad set of ports, needs replacement per rt 3683 [23:02:17] !log db1008.mgmt returned to service [23:02:21] Logged the message, RobH [23:02:28] pgehres: well well well, it says it's done already [23:02:31] Logged the message, RobH [23:02:34] lemme just check on replication [23:02:45] i gave db1008 the evil eye [23:02:47] its gonna behave now. [23:03:09] you could scrawl "poor impulse control" across the front too [23:03:11] or "miscreant" [23:03:25] then we'll photograph it for a banner campaign [23:03:53] i explained that making me drive down here was breaking one of the laws of robotics [23:04:01] so it knows if it does it again, i will old yeller it. [23:04:21] im out! [23:07:37] New patchset: Ryan Lane; "Initial commit of deployment module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27478 [23:08:38] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27478 [23:08:59] New patchset: Ryan Lane; "Add a role for the deployment module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27479 [23:10:02] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27479 [23:15:58] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 265 seconds [23:20:59] New patchset: Dzahn; "planet - adding more translations per meta.wm" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27484 [23:21:58] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27484 [23:22:46] New patchset: Ryan Lane; "Add a role for the deployment module" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/27479 [23:23:40] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/27479 [23:27:58] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:34:25] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 4 seconds [23:44:07] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.015 seconds [23:47:52] PROBLEM - mysqld processes on db1013 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld