[00:10:02] PROBLEM - Puppet freshness on srv211 is CRITICAL: Puppet has not run in the last 10 hours [00:19:46] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:30:25] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 2.919 seconds [00:37:37] PROBLEM - MySQL disk space on db78 is CRITICAL: DISK CRITICAL - free space: /a 116104 MB (3% inode=99%): [00:41:58] PROBLEM - Puppet freshness on locke is CRITICAL: Puppet has not run in the last 10 hours [00:43:37] !log patched @requires handling out of phpunit on gallium so that it stops segfaulting and accepts Id9f2fea7 [00:43:47] Logged the message, Master [01:13:37] RECOVERY - Puppet freshness on cp1015 is OK: puppet ran at Mon Jan 14 01:13:24 UTC 2013 [01:15:34] PROBLEM - MySQL Slave Delay on db1025 is CRITICAL: CRIT replication delay 306 seconds [01:17:22] RECOVERY - MySQL Slave Delay on db1025 is OK: OK replication delay 10 seconds [01:21:16] !log tstarling synchronized php-1.21wmf6/bin/ulimit5.sh [01:21:28] Logged the message, Master [01:21:44] !log tstarling synchronized php-1.21wmf6/includes/DefaultSettings.php [01:21:54] Logged the message, Master [01:22:15] !log tstarling synchronized php-1.21wmf6/includes/GlobalFunctions.php [01:22:24] Logged the message, Master [01:24:27] !log tstarling synchronized php-1.21wmf7/bin/ulimit5.sh [01:24:38] Logged the message, Master [01:24:45] !log tstarling synchronized php-1.21wmf7/includes/DefaultSettings.php [01:24:54] Logged the message, Master [01:25:00] !log tstarling synchronized php-1.21wmf7/includes/GlobalFunctions.php [01:25:12] Logged the message, Master [01:37:56] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [01:43:29] PROBLEM - MySQL disk space on db78 is CRITICAL: DISK CRITICAL - free space: /a 116591 MB (3% inode=99%): [02:25:39] !log LocalisationUpdate completed (1.21wmf7) at Mon Jan 14 02:25:38 UTC 2013 [02:25:51] Logged the message, Master [02:50:08] !log LocalisationUpdate completed (1.21wmf6) at Mon Jan 14 02:50:08 UTC 2013 [02:50:20] Logged the message, Master [03:07:56] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [03:07:56] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [03:12:34] New review: Tim Starling; "We can probably just package texvc now, like we do for wikidiff2. It's been a year since anyone made..." [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/43354 [03:15:43] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:19:10] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.863 seconds [03:22:10] RECOVERY - Puppet freshness on ms-fe1001 is OK: puppet ran at Mon Jan 14 03:21:36 UTC 2013 [03:40:47] New patchset: Tim Starling; "Add the search.wikimedia.org script to the main apache pool" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43784 [03:40:48] New patchset: Tim Starling; "Maintenance of Apple dictionary bridge script" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43785 [03:50:48] New patchset: Tim Starling; "Add search.wikimedia.org to the main apache pool" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/43787 [03:52:09] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43784 [03:52:21] Change merged: Tim Starling; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43785 [03:53:25] !log tstarling synchronized docroot/search.wikimedia.org/index.php [03:53:35] Logged the message, Master [03:54:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [03:55:37] New patchset: Tim Starling; "Add search.wikimedia.org to the main apache pool" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/43787 [03:56:09] Change merged: Tim Starling; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/43787 [04:05:13] !log tstarling synchronized docroot [04:05:24] Logged the message, Master [04:06:36] New patchset: Tim Starling; "Fix search.wikimedia.org order" [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/43788 [04:06:52] Change merged: Tim Starling; [operations/apache-config] (master) - https://gerrit.wikimedia.org/r/43788 [04:10:37] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 0.047 seconds [04:31:01] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [04:31:02] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [04:31:02] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [04:31:02] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [04:31:02] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [04:37:55] PROBLEM - Puppet freshness on gallium is CRITICAL: Puppet has not run in the last 10 hours [07:31:55] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [08:08:58] PROBLEM - Host srv278 is DOWN: PING CRITICAL - Packet loss = 100% [08:12:34] RECOVERY - Host srv278 is UP: PING OK - Packet loss = 0%, RTA = 0.24 ms [08:16:51] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [08:18:07] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [08:18:43] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.086 second response time [08:29:02] !log gallium: removed /mnt/jenkins-tmp tmpfs, replaced by /var/lib/jenkins/tmpfs {{bug|43825}} [08:29:12] Logged the message, Master [09:09:43] RECOVERY - Puppet freshness on locke is OK: puppet ran at Mon Jan 14 09:09:11 UTC 2013 [09:23:41] PROBLEM - MySQL disk space on neon is CRITICAL: DISK CRITICAL - free space: / 348 MB (3% inode=56%): [09:34:46] RECOVERY - Puppet freshness on srv211 is OK: puppet ran at Mon Jan 14 09:34:22 UTC 2013 [09:35:23] Hello everyone, my name is Mariya and I am an intern with WMF. I am trying to get editting permission for wikitech.wikimedia.com and Sumana told me I can get help in this channel. [09:50:57] mitevam: Hi. [09:51:13] mitevam: You'll need to find one of these users to have an account created for you: http://wikitech.wikimedia.org/view/Special:ListUsers/sysop [09:51:39] hashar: You around? [09:51:54] [04:35] Hello everyone, my name is Mariya and I am an intern with WMF. I am trying to get editting permission for wikitech.wikimedia.com and Sumana told me I can get help in this channel. [09:51:58] hashar: ^ [09:52:12] thank you Susan [09:52:12] hashar: Also, if you could make me an admin while you're over there... ;-) [09:52:48] also aude seems to be on the list, aude are you around? [09:53:07] A bunch of people are in here. [09:53:15] Most are probably asleep or away, though. [09:54:55] It's wikitech.wikimedia.org, BTW. [09:55:14] hi mitevam [09:56:06] sorry Susan, my mistake :) [09:56:14] hi aude [09:56:22] mitevam: since you are an intern, i think it's okay for me to do it [09:56:59] thank you aude, what do you need from me? [09:57:47] your email, if you want to pm it to me [09:59:45] Susan: you're a hat collector. [10:04:42] Susan: sorry back [10:04:48] 1 hour to find out my passport :D [10:05:00] my flat is basically a huge mess of pending paper work doh [10:05:32] so wikitech account for mitevam is that it ? [10:06:00] aude: congrats :-] [10:06:13] mitevam: take extra care when editing the wikitech wiki :-D [10:06:34] hashar i am set, aude already helped me [10:07:11] and i will try [10:07:19] not mess anything up :) [10:07:25] you are making me nervous haha :) [10:08:26] mitevam: just be careful and ask if you do not know ;-] You will be fine [10:09:12] hashar will do :) [10:54:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:56:22] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 9.836 seconds [11:18:43] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 219 seconds [11:18:52] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 223 seconds [11:27:36] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [11:27:52] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [11:30:34] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [11:39:08] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [11:43:01] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 5.279 seconds [12:18:25] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:23:35] New review: Hashar; "Oh consistency, sure. I have opened https://bugzilla.wikimedia.org/show_bug.cgi?id=43956 to track th..." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/39056 [12:30:52] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK HTTP/1.1 400 Bad Request - 336 bytes in 6.215 seconds [13:08:49] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [13:08:50] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [13:20:31] !log jenkins: gating jobs now always report a message, even if tests complete after the change has been merged in {{gerrit|43407}} [13:20:41] Logged the message, Master [13:33:17] !log upgrade PEAR packages on gallium [13:33:25] PROBLEM - Memcached on virt0 is CRITICAL: Connection refused [13:33:26] Logged the message, Master [13:34:32] !log upgrade PEAR packages on gallium {{bug|43957}} [13:34:41] Logged the message, Master [13:47:22] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 221 seconds [13:47:49] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 228 seconds [13:49:28] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [13:50:49] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [13:59:22] RECOVERY - Memcached on virt0 is OK: TCP OK - 0.003 second response time on port 11000 [14:31:55] PROBLEM - Puppet freshness on db1047 is CRITICAL: Puppet has not run in the last 10 hours [14:31:56] PROBLEM - Puppet freshness on ms-fe1003 is CRITICAL: Puppet has not run in the last 10 hours [14:31:56] PROBLEM - Puppet freshness on ms-fe1004 is CRITICAL: Puppet has not run in the last 10 hours [14:31:56] PROBLEM - Puppet freshness on sq48 is CRITICAL: Puppet has not run in the last 10 hours [14:31:56] PROBLEM - Puppet freshness on zinc is CRITICAL: Puppet has not run in the last 10 hours [14:38:50] PROBLEM - Puppet freshness on gallium is CRITICAL: Puppet has not run in the last 10 hours [15:05:23] paravoid: still on ceph? Mark said having a wikimedia puppet module would not be that much of an issue :-] [15:38:04] RECOVERY - MySQL disk space on db78 is OK: DISK OK [16:07:39] New patchset: Ottomata; "Adding for new wikipedia zero filters. Fixing sri lanke filter." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43834 [16:08:26] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43834 [16:19:22] New patchset: Reedy; "Add cron based poll for changes for huwiki deployment" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43837 [16:24:06] New review: Aude; "see comments" [operations/puppet] (production) C: 0; - https://gerrit.wikimedia.org/r/43837 [16:31:13] Anyone know where /usr/local/bin/sql comes from? [16:31:18] I can't see it in puppet [16:32:02] New patchset: Reedy; "Add cron based poll for changes for huwiki deployment" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43837 [16:32:30] Reedy: I see it in files/misc/scripts/sql [16:32:54] misc::deployment::common_scripts class [16:34:56] Damn $scriptpath [16:35:03] cheers [16:35:32] PROBLEM - Packetloss_Average on oxygen is CRITICAL: CRITICAL: packet_loss_average is 19.9149664286 (gt 8.0) [16:40:19] PROBLEM - Host db1023 is DOWN: PING CRITICAL - Packet loss = 100% [16:44:03] New patchset: Petrb; "fixed ram check in nagios" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43843 [16:45:30] New patchset: Reedy; "Add a sqldump script wrapper around mysqldump" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43844 [16:47:58] RECOVERY - Host db1023 is UP: PING OK - Packet loss = 0%, RTA = 26.53 ms [16:48:45] !log Populated huwiki sites and site_identifier tables from testwiki tables [16:48:55] Logged the message, Master [16:53:51] !log powercycling db1037 to verify raid cfg [16:54:00] Logged the message, Master [16:55:09] !log packet loss on oxygen after adding new wikipedia zero filters. Stopping puppet to temporarily comment them out and make sure they are the problem. [16:55:19] Logged the message, Master [16:57:17] PROBLEM - Host db1037 is DOWN: PING CRITICAL - Packet loss = 100% [16:59:31] RECOVERY - Host db1037 is UP: PING OK - Packet loss = 0%, RTA = 26.56 ms [17:02:31] RECOVERY - Packetloss_Average on oxygen is OK: OK: packet_loss_average is 0.0 [17:08:14] New patchset: Cmjohnson; "adding db1032 MAC to dhcpd" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43846 [17:14:13] PROBLEM - MySQL Replication Heartbeat on db1007 is CRITICAL: CRIT replication delay 187 seconds [17:15:25] PROBLEM - MySQL Slave Delay on db1007 is CRITICAL: CRIT replication delay 214 seconds [17:17:08] Change merged: Cmjohnson; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43846 [17:20:36] !log reedy synchronized php-1.21wmf7/includes/ChangesList.php [17:20:47] Logged the message, Master [17:21:26] RECOVERY - MySQL Replication Heartbeat on db1007 is OK: OK replication delay 0 seconds [17:22:19] RECOVERY - MySQL Slave Delay on db1007 is OK: OK replication delay 0 seconds [17:33:35] PROBLEM - Puppet freshness on vanadium is CRITICAL: Puppet has not run in the last 10 hours [18:01:01] PROBLEM - udp2log log age for lucene on oxygen is CRITICAL: CRITICAL: log files /a/log/lucene/lucene.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [18:02:40] "critical amount of time" [18:11:11] New patchset: Hashar; "(bug 39082) documentation for realm/datacenter variance" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43849 [18:19:28] PROBLEM - Puppet freshness on ms1002 is CRITICAL: Puppet has not run in the last 10 hours [18:21:52] RECOVERY - Host analytics1006 is UP: PING OK - Packet loss = 0%, RTA = 26.50 ms [18:29:29] !log adding glusterfs-client glusterfs-common glusterfs-server and glusterfs-dbg for lucid and precise to our repo [18:29:39] Logged the message, Master [18:39:43] PROBLEM - Packetloss_Average on oxygen is CRITICAL: CRITICAL: packet_loss_average is 11.7060512698 (gt 8.0) [18:44:11] New review: Anomie; "A prefix goes before; a suffix goes after." [operations/mediawiki-config] (master); V: 0 C: -1; - https://gerrit.wikimedia.org/r/43849 [18:52:06] New patchset: Ottomata; "Disabling new zero filters, oxygen can't handle it :(" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43855 [18:53:27] Change merged: Ottomata; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43855 [18:55:20] RECOVERY - udp2log log age for lucene on oxygen is OK: OK: all log files active [18:56:04] New patchset: Hashar; "(bug 39082) documentation for realm/datacenter variance" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43849 [19:01:19] RECOVERY - Packetloss_Average on oxygen is OK: OK: packet_loss_average is 0.0 [19:02:20] New patchset: Dzahn; "decom knsq17 - disk fail - RT-4321" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43858 [19:02:59] mutante: decom? [19:03:20] mutante: do you know what should be done for http://lists.wikimedia.org/pipermail/wikitech-l/2013-January/065593.html ? [19:03:38] paravoid: yes, just talked to Rob. [19:04:01] paravoid: either it stays or we can just re-add it if we fix the disk.. but until then its out of monitoring this way [19:04:24] why? [19:04:35] decom means "will never come back" to me [19:05:40] !log reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Rest of the 'pedias to 1.21wmf7 [19:05:50] Logged the message, Master [19:05:53] because otherwise Nagios will report it forever and Rob told me we are doing the temp. decom'ing on a regular basis [19:06:04] but i am also fine just doing an ACK in Nagios and abandoning this [19:06:50] i am also not sure if we ever replace disks in knsq hosts [19:08:13] ACKNOWLEDGEMENT - Backend Squid HTTP on knsq17 is CRITICAL: Connection refused daniel_zahn broken disk - RT-4321 [19:08:23] Change abandoned: Dzahn; "(no reason)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43858 [19:09:30] !log provision sq48 [19:09:40] Logged the message, Master [19:10:10] Nemo_bis: ugh, yeah, so somebody blacklisted our list server.. we need to ask them why and to be removed [19:10:29] turning it into a ticket [19:13:20] who speaks Italian?:) [19:19:20] New patchset: Reedy; "Everything else to wmf7" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43861 [19:19:43] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43861 [19:24:47] !log reedy synchronized php-1.21wmf7/extensions/Wikibase [19:24:56] aude: ^ [19:24:57] Logged the message, Master [19:26:52] yay! [19:39:05] RECOVERY - MySQL Slave Running on db1047 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [19:39:50] RECOVERY - MySQL Slave Running on db1043 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [19:40:28] binasher: hello :-) I got you some documentation for the mediawiki-config multiple datacenter support. https://gerrit.wikimedia.org/r/#/c/43849/ :D [19:40:38] mutante: do you need something? :) [19:40:39] anomie reviewing it :-] [19:40:44] hashar: we're in a meeting right now [19:41:07] Nemo_bis: hmm..maybe the best email address to ask them to be removed? but i'll try standards [19:41:10] oh that is true, the ops meeting. Thanks for the notice Faidon [19:42:05] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 509384 seconds [19:42:24] mutante: I don't know what's the best one; if the standards don't work we can ask, we know someone in Telecom Italia [19:43:08] Nemo_bis: oh, cool, i will get back to you if needed. if you want to poke us about it later.. it is now RT-4327 [19:44:11] https://meta.wikimedia.org/wiki/Wikimedia_Blog/Drafts/Ops,_2011-2013 is the current messy draft and I hope to finish it today and send to engineering@ for corrections [19:44:47] PROBLEM - MySQL Slave Delay on db1043 is CRITICAL: CRIT replication delay 508251 seconds [19:46:04] New patchset: Aude; "Update settings for Wikibase" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43867 [19:48:20] New patchset: Aude; "Update settings for Wikibase" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43867 [19:51:03] mutante: ok, I told Frieda, she's at Telecom tomorrow [19:51:12] Nemo_bis: thanks! [19:51:19] mutante: can I ask you an unrelated favour? [19:53:32] Nemo_bis: what is it [19:55:08] RECOVERY - SSH on sq48 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1 (protocol 2.0) [20:01:08] RECOVERY - Puppet freshness on sq48 is OK: puppet ran at Mon Jan 14 20:00:56 UTC 2013 [20:01:17] PROBLEM - udp2log log age for lucene on oxygen is CRITICAL: CRITICAL: log files /a/log/lucene/lucene.log, have not been written in a critical amount of time. For most logs, this is 4 hours. For slow logs, this is 4 days. [20:05:24] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43867 [20:05:56] RECOVERY - NTP on sq48 is OK: NTP OK: Offset -0.006504774094 secs [20:07:06] !log reedy synchronized wmf-config/ [20:07:15] Logged the message, Master [20:16:58] New review: Faidon; "Did we discuss this at an ops meeting? (I honestly can't remember)." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/42791 [20:17:14] New patchset: Aude; "fix test2wiki setting for Wikibase" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43875 [20:18:34] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43875 [20:19:28] !log reedy synchronized wmf-config/InitialiseSettings.php [20:19:34] Logged the message, Master [20:20:47] New patchset: Aude; "Enable WikibaseClient on Hungarian Wikipedia" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43876 [20:22:59] New patchset: Reedy; "Add a sqldump script wrapper around mysqldump" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43844 [20:24:35] notpeter, sorry it's me again:) Has Solr monitoring been resolved? http://nagios.wikimedia.org/nagios/cgi-bin/status.cgi?navbarsearch=1&host=solr2 suggests that it was still in OK state when the host was down [20:26:30] Change merged: Reedy; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43876 [20:27:42] !log reedy synchronized wmf-config/InitialiseSettings.php 'Enable WikibaseClient on huwiki' [20:27:51] Logged the message, Master [20:28:42] Could I get someone from ops to merge and push https://gerrit.wikimedia.org/r/#/c/43837/2 through? [20:28:52] Needed to go with the huwiki deployment that's just happened. Thanks [20:33:51] !log xenon & caesium added to parsoid pybal config [20:34:00] Logged the message, RobH [20:34:46] MaxSem: that was resolved by me getting insanely sick and forgetting about the whole world. I'm a bit busy today, but I'll put it on my todo list [20:35:01] cool, thanks [20:35:31] * aude impatient to see wikidata stuff in recent changes on huwiki [20:35:56] Change merged: ArielGlenn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43837 [20:36:50] !log reedy cleared profiling data [20:36:50] PROBLEM - Parsoid on xenon is CRITICAL: Connection refused [20:36:50] PROBLEM - Parsoid on caesium is CRITICAL: Connection refused [20:36:55] hope you feel better today notpeter [20:36:59] Logged the message, Master [20:37:21] sumanah: I'm finally pretty not dead :) [20:40:13] Thanks apergos [20:40:24] yw [20:41:12] hume run done, looks ok [20:46:37] * aude searches for wikidata edits [20:52:33] I guess we need a hume replacement in eqiad... [20:52:36] you devs love hume. [20:52:44] RECOVERY - udp2log log age for lucene on oxygen is OK: OK: all log files active [20:55:34] RobH: that is kind of a work machine to offload fenari [20:55:58] its the scripting host, yea. [20:56:07] i think it can be replaced in eqiad with the smallest misc host possible. [20:56:11] but not sure. [20:56:14] script1001.eqiad ? :-] [20:56:35] I can ask during our mediawiki meeting in an hour or so [20:56:36] didn't we have such a box in eqiad already? [20:56:44] don't remember which one, but it rings a bell [20:56:46] im looking [20:57:25] no other server currently uses the same puppet calls (ie: misc::maintenance::foundationwiki, etc) [20:57:32] but maybe is allocated and just not setup [20:57:34] * RobH is looking [20:57:53] added the item to the agenda [21:00:00] MaxSem: You gave me back yttrium if i recall correctly [21:00:01] yes? [21:00:07] RobH, yup [21:00:08] (i may be making it the eqiad hume) [21:00:17] cool, noted so i wont ask again, thx =] [21:00:22] whee, it's an honor!:P [21:12:09] hashar: looks like i assigned 'tin' as a delployment server in eqiad. [21:12:28] which i assume folks dont use, but it can comebine with the roles needed to replace hume. [21:13:01] might be [21:13:06] ask Ryan_Lane about tin :) [21:13:07] hrmm, and flourine is deployemnt hosts [21:13:09] bleh [21:13:17] how many deployment hosts we have in eqiad >_< [21:13:28] Ryan_Lane: So which of the eqiad deployment hosts is your test deploy git thing? [21:15:00] paravoid: I have server magnesium as a 'swift deployment test host' [21:15:09] Can I reclaim this since we have dedicagted hosts now? [21:16:01] New patchset: Mwang; "proxy to provide web access to instances that lack a public IP." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43886 [21:16:07] I don't have any use for it [21:16:20] New patchset: Jgreen; "deprecating aluminium's community-analytics vhost" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43887 [21:16:42] AaronSchulz: do you use magnesium for tests? [21:16:57] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43887 [21:23:09] New patchset: Pyoungmeister; "cleanup of pmtpa dbs in site.pp" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43889 [21:23:17] andrewbogott: git is working now. Class nginx-proxy was uploaded successfully. [21:23:35] paravoid / AaronSchulz: So before we ordered the swift hosts, both zinc and mangesium were allocated as swift testing [21:23:41] andrewbogott: http://justpaste.it/1s5q [21:23:46] i know paravoid said he didnt need, just giving the info [21:23:59] New patchset: Asher; "keep pcache binlogs for one day" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43890 [21:24:00] AaronSchulz: so if you confirm you guys arent using them now, i can use them for other tasks. [21:24:27] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43890 [21:24:47] oh, and copper. [21:24:51] and owa* [21:25:05] binasher: About? [21:25:10] paravoid: owa? [21:25:27] owa1, owa2, owa3 [21:25:27] outlook web access [21:25:33] owa1-3 are being used in swift testing still? [21:25:38] or were and can be reclaimed? [21:25:43] they were apparently [21:25:43] * ^demon whacks Reedy [21:25:47] Reedy: sadly, that's what I think of, as well... [21:25:50] according to puppet at least :) [21:25:52] * RobH has all kinds of servers [21:25:53] wooooo [21:25:56] I don't use them [21:26:09] I think I can manage fine with labs for testing and ms-[bf]e for production. [21:26:11] well, then the only other person is maybe Aaron and thats it afaik [21:26:21] !log rebooting pc1 [21:26:24] iirc all those were before we got backends in for you guys [21:26:27] I'm pretty sure Aaron was using some of them [21:26:28] Reedy: hey [21:26:30] Logged the message, Master [21:26:37] cool, i'll email him about them and CC ya [21:26:54] RECOVERY - Host ms-be1007 is UP: PING OK - Packet loss = 0%, RTA = 26.57 ms [21:26:57] not "us guys", this was all way before me :-) [21:27:07] indeed [21:27:19] i mean 'whatever person is stuck with mediastorage at the time' [21:27:24] heh [21:28:36] binasher: Hey. Seems huwiki (and dewiki, frwiki but not enwiki) have recentchanges.rc_params as a varbinary, whereas it should be a blob. huwiki has 0.17M rows, totalling 0.08G space. Can I safely just change that one over? (causing problems with the wikidata deploy) and then put it on your TODO list to fix the other wikis (will have to make a list of them) [21:29:01] PROBLEM - Host pc1 is DOWN: PING CRITICAL - Packet loss = 100% [21:29:14] New patchset: Aude; "Disable recent changes wikibase stuff in huwiki" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43892 [21:30:04] pooor pc1 [21:30:50] !log tearing down community-analytics vhost and cname [21:30:52] Reedy: go for it. will de/frwiki be the only db's still needing that migration? [21:31:01] Logged the message, Master [21:31:20] binasher: so it seems we delivered you a "per datacenter configuration variance" product but forgot to ship the manual :/ [21:31:44] binasher: would be great to have the rc_params column for wikibase [21:31:50] <^demon> hashar: Just file a "Write docs" bug and have it block bug 1. Has worked for years :) [21:32:04] binasher: we are very sorry about this incident mr BinAsher. You have already received the new manual in your favorite review interface at https://gerrit.wikimedia.org/r/#/c/43849/ :-] [21:32:09] RECOVERY - Host pc1 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [21:32:34] ^demon: well he reopened the bug that was set on highest priority. That works well too :-] [21:32:59] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43889 [21:33:10] binasher: huwiki done, 9.92 seconds. I've not checked the rest of the wikis, but I'm guessing there is likely to be numerous that will need fixing [21:33:43] hashar: ah hah, thank you! [21:34:14] Reedy: deploy aftermath? [21:34:21] Not so much aftermath [21:34:25] Inconsistent db tables... [21:34:47] gross [21:34:59] Let me grab a headset [21:35:01] uh oh [21:35:12] what's inconsistent? [21:35:15] binasher: both dewiki and frwiki are just over 0.52G each. Did you have a script for checking this before? (Based on teh fact we've been there, seen this, got the t-shirt) [21:35:29] oh [21:35:33] * apergos did the scrollback [21:35:36] rc_params on huwiki (and dewiki and frwiki...) are varbinary(255). They should be blobs [21:36:10] Change abandoned: Reedy; "Not needed." [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43892 [21:37:49] do I assume (a bit off topic) that the page_content_model field will get added to enwiki pretty soon? [21:38:16] Nope [21:38:22] Apparently we shouldn't need it [21:38:24] oh? [21:38:31] Commons might be the big wiki that may benefit from getting it [21:38:35] but aswikisource (for example) has it [21:38:48] We're adding it on new wikis (no point) as it's default in the schema [21:38:54] I see [21:40:06] PROBLEM - Puppet freshness on ms1004 is CRITICAL: Puppet has not run in the last 10 hours [21:40:54] Reedy: when we used the php migrations scripts, it would use mw functions to proceed based on conditionals such as what the column currently is. now, i would just use foreachwiki sql.php to generate a quick lists of all wikis needing the change [21:41:21] binasher: it worked :) [21:41:25] (eltér | történet) . . New York (Q60); 21:33 . . Aude (vitalap | szerkesztései) (Nyelvközi hivatkozás törlése: yo:New York) [21:42:12] PROBLEM - LDAP on sanger is CRITICAL: Connection refused [21:43:06] PROBLEM - LDAPS on sanger is CRITICAL: Connection refused [21:45:38] New review: Aude; "thank goodness! :) things are working nicely now" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43892 [21:46:01] Did something happen to the gerrit or mail server just now? [21:46:12] I got "Server something.wmnet rejected body" error in Gerrit [21:46:17] I works now, but still. [21:46:22] (I don't recall the exact something) [21:48:24] New patchset: Jgreen; "clean up users on aluminium" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43895 [21:48:51] <^demon> Krinkle: Nope, nothing up with gerrit. [21:49:01] <^demon> Maybe sanger, see nagios a bit ago? [21:49:08] Change merged: Jgreen; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43895 [21:56:13] binasher: Ugh. The minority are blobs [21:56:42] some are blob, some are blob not null [21:56:59] <^demon> Did it used to be not-blob? [21:57:02] Nope [21:57:43] Reedy: 805 wikis != blob :( [21:57:49] yup [21:59:54] `rc_params` blob, [21:59:54] `rc_params` varbinary(255) NOT NULL DEFAULT '', [21:59:54] `rc_params` varbinary(255) NOT NULL, [21:59:54] `rc_params` blob NOT NULL, [22:02:21] https://gerrit.wikimedia.org/r/gitweb?p=mediawiki/core.git;a=history;f=maintenance/archives/patch-rc_deleted.sql;h=f4bbd0f9317a186f0bab2836f65e6b22a694fe7f;hb=HEAD [22:02:40] AaronSchulz added it March 2007 [22:02:42] + ADD rc_params blob NOT NULL default '' [22:04:36] !log Modified rc_params on huwiki to make it a blob [22:04:46] Logged the message, Master [22:08:03] Change merged: Hashar; [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43849 [22:10:34] !log pullen {{gerrit|43849}} on fenari (simply changes some comments) [22:10:44] Logged the message, Master [22:12:37] New patchset: Dzahn; "common template dir is now ../../common, not just ../common" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43958 [22:13:08] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43958 [22:19:51] New review: Andrew Bogott; "Nice work!" [operations/puppet] (production); V: 0 C: 0; - https://gerrit.wikimedia.org/r/43886 [22:28:23] New patchset: Dzahn; "let planet.css be created in every language subdir, but from the same source (for now)" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43959 [22:29:15] New patchset: Pyoungmeister; "migrating 1 box in s1-s7 in pmtpa to coredb" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43960 [22:31:19] Change merged: Dzahn; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43959 [22:32:19] Nemo_bis: I look good in hats. [22:32:39] New review: Faidon; "Why an "nginx-proxy" module? We need an nginx module that does the generic nginx parts -and can be r..." [operations/puppet] (production); V: 0 C: -1; - https://gerrit.wikimedia.org/r/43886 [22:34:00] Susan: prove it. I want photos. [22:38:23] so, anyone think of a reason that sanger would think it's a .pmtpa.wmnet instead of a .wikimedia.org according to puppet ? [22:38:43] what do you mean? [22:38:58] root@sanger:~# hostname -f [22:38:58] sanger.wikimedia.org [22:38:59] RECOVERY - LDAPS on sanger is OK: TCP OK - 0.013 second response time on port 636 [22:39:11] 187.152.80.208.in-addr.arpa domain name pointer sanger.wikimedia.org. [22:39:20] i mean it does info: Caching catalog for sanger.wikimedia.org , but the $name is .pmtpa.wmnet [22:39:21] there you go, opendj back [22:39:30] i had to kill it.. the java process was still using port 4444 [22:39:32] err: /Stage[main]/Role::Ldap::Server::Corp/Install_certificate[sanger.pmtpa.wmnet]/File[/etc/ssl/certs/sanger.pmtpa.wmnet.pem]: Could not evaluate: Could not retrieve information from environment production source(s) puppet:///files/ssl/sanger.pmtpa.wmnet.pem at /var/lib/git/operations/puppet/manifests/certs.pp:109 [22:39:53] RECOVERY - LDAP on sanger is OK: TCP OK - 0.002 second response time on port 389 [22:39:57] but the relevant bits in cert.pp is like "/etc/ssl/certs/${name}.pem": [22:41:26] so it should be looking for sanger.wikimedia.org , not .pmtpa.wmnet [22:41:27] $certificate = "$hostname.pmtpa.wmnet" [22:41:28] install_certificate{ $certificate: } [22:41:38] role/ldap.pp:202 [22:41:58] ah [22:42:01] nasty [22:42:07] damn you!!! [22:42:13] why would someone do that ? [22:42:37] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43960 [22:43:08] commit 45890d8ad844454f47902085bcd7167abd5bf076 [22:43:09] Author: Ryan Lane [22:43:16] also has it really been broken since august ? [22:43:23] yeah, i just git blamed too :) [22:43:33] i guess it has been ... [22:44:03] so i can do ${::fqdn} - i think [22:44:04] it looks like this was moved from the nfs1/nfs2 site.pp stanza [22:44:38] fqdn would work, yes [22:45:00] !log killed java process on sanger, restarted opendj [22:45:11] Logged the message, Master [22:45:33] New patchset: Lcarr; "switching .pmtpa.wmnet to use fqdn for certificate naming" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43961 [22:45:36] would you double check this paravoid? [22:45:43] New patchset: Andrew Bogott; "Fix log rotation for gluster server; also rotate brick logs." [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43962 [22:46:16] !log installing apache,mysql-common,perl,sudo package updates on sanger [22:46:26] Logged the message, Master [22:46:51] New review: Faidon; "Doh!" [operations/puppet] (production); V: 0 C: 1; - https://gerrit.wikimedia.org/r/43961 [22:53:43] New patchset: Pyoungmeister; "ganglia aggregator on db51" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43963 [22:53:51] New patchset: Hashar; "beta: switch all but one wiki to php-1.21wmf7" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43964 [22:54:33] Change merged: Pyoungmeister; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43963 [23:01:09] Change merged: Andrew Bogott; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43962 [23:03:17] New patchset: Asher; "pcache dbs need unique server-id's in this brave new replicated world" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43965 [23:03:38] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43965 [23:09:53] PROBLEM - Puppet freshness on ocg3 is CRITICAL: Puppet has not run in the last 10 hours [23:09:53] PROBLEM - Puppet freshness on virt1004 is CRITICAL: Puppet has not run in the last 10 hours [23:09:59] New patchset: Asher; "s2 master was incorrect" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43966 [23:10:30] Change merged: Asher; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43966 [23:10:38] PROBLEM - MySQL Replication Heartbeat on db1011 is CRITICAL: CRIT replication delay 248 seconds [23:10:38] Hm.. officewiki db locked? [23:10:47] PROBLEM - MySQL Replication Heartbeat on db33 is CRITICAL: CRIT replication delay 255 seconds [23:10:56] PROBLEM - MySQL Replication Heartbeat on db1020 is CRITICAL: CRIT replication delay 266 seconds [23:12:05] and back up [23:12:26] RECOVERY - MySQL Replication Heartbeat on db1011 is OK: OK replication delay 0 seconds [23:12:29] Krinkle: works for me [23:12:33] ah [23:12:44] It was locked for about 4-5 minutes. [23:14:14] RECOVERY - MySQL Replication Heartbeat on db33 is OK: OK replication delay 0 seconds [23:14:23] PROBLEM - MySQL Slave Delay on db1020 is CRITICAL: CRIT replication delay 208 seconds [23:16:02] RECOVERY - MySQL Slave Delay on db1020 is OK: OK replication delay 0 seconds [23:16:11] RECOVERY - MySQL Replication Heartbeat on db1020 is OK: OK replication delay 0 seconds [23:18:28] !log changed CNAME for search.wikimedia.org to point to wikimedia-lb instead of ekrem [23:18:39] Logged the message, Master [23:41:43] Hey, does anyone know who can bug about getting an account on wikitech.wikimedia.org? [23:44:50] csteipp: http://wikitech.wikimedia.org/index.php?title=Special:ListUsers&group=bureaucrat or maybe also http://wikitech.wikimedia.org/index.php?title=Special:ListUsers&group=sysop [23:45:18] pgehres: Thanks! [23:47:42] New patchset: Ryan Lane; "Switch beta to new deployment scripts" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43968 [23:48:12] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43968 [23:49:59] Hey mutante, any chance you could create me an account on wikitech? [23:50:56] I'll do it [23:51:20] There's quite a lot of swift related warnings appering int he logs [23:51:30] csteipp: what username do you want? [23:51:45] TimStarling: CSteipp [23:52:27] you should have an email [23:52:57] New patchset: Ryan Lane; "Temporarily remove file requirement" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43969 [23:53:08] New review: Hashar; "JenkinsBot was not allowed to merge in changes for us. Chad fixed it :-)" [operations/mediawiki-config] (master) - https://gerrit.wikimedia.org/r/43849 [23:53:29] Change merged: Ryan Lane; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43969 [23:54:40] TimStarling: Thanks! [23:56:27] robla, wikitech RFA is brutal! [23:56:42] New patchset: Pyoungmeister; "migrating all pmtpa slaves to coredb-based roles" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43970 [23:57:09] Thehelpfulone: merely brutal? we must not be trying hard enough :) [23:59:22] New patchset: Pyoungmeister; "migrating all pmtpa slaves to coredb-based roles" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/43970 [23:59:27] heh, well I've been known to be nice at RFA [23:59:32] Thehelpfulone: that guy with the image removal request..he says he did not get your OTRS reply.. can you help?:) [23:59:47] does he need to login?