[00:30:47] PROBLEM - Puppet freshness on eeden is CRITICAL: No successful Puppet run in the last 10 hours [01:19:52] !log reedy synchronized php-1.22wmf14/includes/specials/SpecialUserlogin.php [01:19:59] Logged the message, Master [01:26:11] PROBLEM - SSH on pdf1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:28:01] RECOVERY - SSH on pdf1 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [01:50:44] (03PS1) 10Jgreen: remove external X-Spam-Flag header [operations/puppet] - 10https://gerrit.wikimedia.org/r/80527 [01:57:46] (03CR) 10Jgreen: [C: 032 V: 031] remove external X-Spam-Flag header [operations/puppet] - 10https://gerrit.wikimedia.org/r/80527 (owner: 10Jgreen) [02:00:08] TimStarling: is it possible that you forgot to merge the udp2log filter change for fluorine on puppetmaster? [02:00:59] the change hasn't updated fluorine:/etc/udp2log/mw and the Puppet log doesn't show an error, but it may not go back for enough. [02:02:39] what change? [02:02:54] you mean https://gerrit.wikimedia.org/r/#/c/80164/ ? because that wasn't merged [02:03:46] Ugh. I don't know why I was convinced that it was. Probably I mixed up my browser tabs. [02:04:18] sorry for the ping. [02:15:37] !log LocalisationUpdate completed (1.22wmf13) at Fri Aug 23 02:15:37 UTC 2013 [02:15:40] PROBLEM - Puppet freshness on cp1063 is CRITICAL: No successful Puppet run in the last 10 hours [02:15:44] Logged the message, Master [02:47:37] !log LocalisationUpdate completed (1.22wmf14) at Fri Aug 23 02:47:36 UTC 2013 [02:47:42] Logged the message, Master [02:59:10] !log LocalisationUpdate ResourceLoader cache refresh completed at Fri Aug 23 02:59:10 UTC 2013 [02:59:16] Logged the message, Master [03:10:25] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [03:13:35] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [03:34:05] PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 10 hours [03:41:22] PROBLEM - Disk space on cp1060 is CRITICAL: DISK CRITICAL - free space: /srv/sda3 12436 MB (3% inode=99%): /srv/sdb3 13200 MB (4% inode=99%): [05:10:27] RECOVERY - MySQL Slave Running on db1016 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [05:12:10] !log resolved db1016 replication sql error [05:12:16] Logged the message, Master [05:12:57] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 344394 seconds [05:13:17] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 344248 seconds [05:18:03] !log stopping db35 mysqld for mysql.* checks and mysql_upgrade [05:18:09] Logged the message, Master [05:19:07] RECOVERY - MySQL Slave Running on db35 is OK: OK replication [05:24:01] !log db35 manual mysql.* check, restart, mysql_upgrade, resolve replication sql error [05:24:07] Logged the message, Master [05:24:37] PROBLEM - MySQL Slave Delay on db35 is CRITICAL: CRIT replication delay 10476497 seconds [05:24:53] what exactly does manual check entail? [05:26:13] myisamchk on mysql system tables data and index files [05:37:44] ewww, myisam. :P [05:38:10] to me "manual" sounds like running some queries [05:53:37] RECOVERY - MySQL Slave Delay on db35 is OK: OK replication delay 0 seconds [05:56:07] PROBLEM - MySQL Slave Running on db35 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Query caused different errors on master and slave. Error on maste [06:00:07] RECOVERY - MySQL Slave Running on db35 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [06:06:07] PROBLEM - MySQL Slave Running on db35 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Query caused different errors on master and slave. Error on maste [06:08:30] !log ran mysql_upgrade on db1001 (due to missing/incorrect system tables errors in log) [06:08:36] Logged the message, Master [06:12:02] RECOVERY - MySQL Slave Running on db35 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [06:21:02] PROBLEM - MySQL Slave Running on db35 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Query caused different errors on master and slave. Error on maste [06:23:02] RECOVERY - MySQL Slave Running on db35 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [06:25:32] PROBLEM - MySQL Slave Delay on db35 is CRITICAL: CRIT replication delay 10400699 seconds [06:35:12] PROBLEM - SSH on pdf1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:37:44] (03PS1) 10PleaseStand: More StartProfiler.php cleanup [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80528 [06:39:06] RECOVERY - SSH on pdf1 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [06:43:06] PROBLEM - SSH on pdf1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:44:06] RECOVERY - SSH on pdf1 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [07:02:56] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [07:03:35] it is not [07:03:40] wtf [07:04:17] PROBLEM - MySQL Slave Running on db1016 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error Cant find any matching row in the user table on query. Defau [07:16:47] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [07:25:17] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay 0 seconds [07:25:17] RECOVERY - MySQL Slave Running on db1016 is OK: OK replication [07:27:14] !log enabled dns reverse-lookups on db1016 mysql while replication catches up, to avoid slave sql thread stopping on SET PASSWORD / GRANT statements using hostnames [07:27:19] Logged the message, Master [07:27:57] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 323646 seconds [07:28:17] PROBLEM - MySQL Replication Heartbeat on db1016 is CRITICAL: CRIT replication delay 323625 seconds [07:28:57] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [07:30:17] PROBLEM - MySQL Slave Running on db1016 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Query caused different errors on master and slave. Error on maste [07:51:22] RECOVERY - MySQL Slave Running on db1016 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [07:53:52] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 324132 seconds [08:14:48] (03PS2) 10TTO: skwiktionary: Set site logo to local file [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80321 (owner: 10Danny B.) [08:15:46] (03CR) 10TTO: [C: 04-1] "How embarrassing. Let me fix that." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80321 (owner: 10Danny B.) [08:18:14] (03PS3) 10TTO: skwiktionary: Set site logo to local file [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80321 (owner: 10Danny B.) [08:27:42] (03PS1) 10TTO: Add IP throttling exception for an Indian Wikipedia workshop [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80537 [08:42:55] PROBLEM - SSH on pdf1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:44:45] RECOVERY - SSH on pdf1 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [09:31:28] (03CR) 10Yuvipanda: "Can this either be rebased, or abandoned?" [operations/puppet] - 10https://gerrit.wikimedia.org/r/67055 (owner: 10Petrb) [09:46:01] (03PS1) 10Aude: enable url data type for test wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80541 [10:12:08] PROBLEM - Ceph on ms-fe1001 is CRITICAL: Ceph HEALTH_WARN 12 pgs recovering: 13 requests are blocked 32 sec: recovery 11/910519311 degraded (0.000%): recovering 0 o/s, 0B/s, 363K key/s [10:12:28] PROBLEM - Ceph on ms-fe1004 is CRITICAL: Ceph HEALTH_WARN 10 pgs recovering: 13 requests are blocked 32 sec: recovery 9/910519551 degraded (0.000%) [10:12:28] PROBLEM - Ceph on ms-fe1003 is CRITICAL: Ceph HEALTH_WARN 10 pgs recovering: 13 requests are blocked 32 sec: recovery 9/910519551 degraded (0.000%) [10:15:08] RECOVERY - Ceph on ms-fe1001 is OK: Ceph HEALTH_OK [10:15:28] RECOVERY - Ceph on ms-fe1004 is OK: Ceph HEALTH_OK [10:15:28] RECOVERY - Ceph on ms-fe1003 is OK: Ceph HEALTH_OK [10:15:42] !log upgrading ceph to v0.67.2 [10:15:47] Logged the message, Master [10:18:08] PROBLEM - Ceph on ms-fe1001 is CRITICAL: Ceph HEALTH_WARN noscrub flag(s) set [10:18:28] PROBLEM - Ceph on ms-fe1004 is CRITICAL: Ceph HEALTH_WARN noscrub flag(s) set [10:18:28] PROBLEM - Ceph on ms-fe1003 is CRITICAL: Ceph HEALTH_WARN noscrub flag(s) set [10:31:38] PROBLEM - Puppet freshness on eeden is CRITICAL: No successful Puppet run in the last 10 hours [10:52:17] (03PS1) 10TTO: Adjust reupload-own permissions for ckbwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80546 [10:56:49] PROBLEM - SSH on pdf1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:57:49] RECOVERY - SSH on pdf1 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [11:05:49] PROBLEM - Host mw31 is DOWN: PING CRITICAL - Packet loss = 100% [11:07:23] RECOVERY - Host mw31 is UP: PING OK - Packet loss = 0%, RTA = 27.66 ms [11:22:02] PROBLEM - NTP on mw31 is CRITICAL: NTP CRITICAL: Offset unknown [11:26:02] RECOVERY - NTP on mw31 is OK: NTP OK: Offset -0.001788854599 secs [11:42:23] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [11:43:14] Do we have documentation of which parts of the sign-up page on [11:43:15] https://wikitech.wikimedia.org/w/index.php?title=Special:UserLogin&type=signup [11:43:27] go to which LDAP fields? [11:44:23] PROBLEM - MySQL Slave Running on db1016 is CRITICAL: CRIT replication Slave_IO_Running: Yes Slave_SQL_Running: No Last_Error: Error You cannot ALTER a log table if logging is enabled on query [11:44:32] It looks like it's �Username� -> cn, sn. �Instance shell account name� -> uid most of the time. [12:05:22] RECOVERY - MySQL Slave Running on db1016 is OK: OK replication Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Error: [12:06:42] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:07:23] PROBLEM - MySQL Slave Delay on db1016 is CRITICAL: CRIT replication delay 16205 seconds [12:07:32] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [12:16:12] PROBLEM - Puppet freshness on cp1063 is CRITICAL: No successful Puppet run in the last 10 hours [12:17:12] RECOVERY - MySQL Replication Heartbeat on db1016 is OK: OK replication delay -1 seconds [12:17:23] RECOVERY - MySQL Slave Delay on db1016 is OK: OK replication delay 0 seconds [12:58:01] PROBLEM - LVS HTTP IPv4 on misc-web-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:58:15] oh? [12:58:21] PROBLEM - LVS HTTPS IPv4 on misc-web-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:58:24] PROBLEM - LVS HTTPS IPv6 on misc-web-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:58:51] PROBLEM - LVS HTTP IPv6 on misc-web-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:59:22] RECOVERY - LVS HTTPS IPv4 on misc-web-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 250 bytes in 8.565 second response time [12:59:25] RECOVERY - LVS HTTPS IPv6 on misc-web-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 248 bytes in 5.730 second response time [12:59:41] RECOVERY - LVS HTTP IPv6 on misc-web-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 261 bytes in 0.001 second response time [12:59:51] RECOVERY - LVS HTTP IPv4 on misc-web-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 262 bytes in 0.001 second response time [13:03:56] antimony... [13:04:15] * paravoid ignores it [13:08:05] (03CR) 10Akosiaris: [C: 032] Making the process of defining host backups easier [operations/puppet] - 10https://gerrit.wikimedia.org/r/80363 (owner: 10Akosiaris) [13:26:11] RECOVERY - check_job_queue on hume is OK: JOBQUEUE OK - all job queues below 10,000 [13:26:30] (03PS1) 10Faidon: authdns: make the Ganglia plugin more resilient [operations/puppet] - 10https://gerrit.wikimedia.org/r/80557 [13:29:21] PROBLEM - check_job_queue on hume is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [13:32:27] (03CR) 10Faidon: [C: 032] authdns: make the Ganglia plugin more resilient [operations/puppet] - 10https://gerrit.wikimedia.org/r/80557 (owner: 10Faidon) [13:34:11] PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 10 hours [14:11:04] (03PS1) 10Faidon: authdns: restart gmond when the plugin changes [operations/puppet] - 10https://gerrit.wikimedia.org/r/80559 [14:12:25] (03PS1) 10Reedy: Maintenance scripts should be run as Apache [operations/puppet] - 10https://gerrit.wikimedia.org/r/80560 [14:12:26] (03CR) 10Faidon: [C: 032] authdns: restart gmond when the plugin changes [operations/puppet] - 10https://gerrit.wikimedia.org/r/80559 (owner: 10Faidon) [14:13:41] (03Abandoned) 10Reedy: Maintenance scripts should be run as Apache [operations/puppet] - 10https://gerrit.wikimedia.org/r/80560 (owner: 10Reedy) [14:16:11] PROBLEM - Disk space on wtp1013 is CRITICAL: DISK CRITICAL - free space: / 73 MB (0% inode=78%): [14:16:40] (03PS6) 10Akosiaris: Refactoring nrpe module (round 2/??) [operations/puppet] - 10https://gerrit.wikimedia.org/r/79329 [14:19:12] RECOVERY - Disk space on wtp1013 is OK: DISK OK [14:19:31] (03CR) 10Akosiaris: [C: 032] "After talking we Faidon we decided that the best course at this point is to leave the if clause in the module despite it being rather uncl" [operations/puppet] - 10https://gerrit.wikimedia.org/r/79329 (owner: 10Akosiaris) [14:28:04] (03PS1) 10Jgreen: adjust spamassassin tagging for OTRS [operations/puppet] - 10https://gerrit.wikimedia.org/r/80561 [14:30:17] (03CR) 10Jgreen: [C: 032 V: 031] adjust spamassassin tagging for OTRS [operations/puppet] - 10https://gerrit.wikimedia.org/r/80561 (owner: 10Jgreen) [14:30:51] PROBLEM - SSH on pdf1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:31:41] RECOVERY - SSH on pdf1 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [14:44:40] (03PS1) 10coren: Tool labs: Fix tcl package installation, add csh [operations/puppet] - 10https://gerrit.wikimedia.org/r/80562 [14:45:37] (03CR) 10coren: [C: 032] "Trivial package tweaks." [operations/puppet] - 10https://gerrit.wikimedia.org/r/80562 (owner: 10coren) [14:48:46] (03PS1) 10Jgreen: always add spam headers for otrs server [operations/puppet] - 10https://gerrit.wikimedia.org/r/80563 [14:49:46] (03CR) 10Jgreen: [C: 032 V: 031] always add spam headers for otrs server [operations/puppet] - 10https://gerrit.wikimedia.org/r/80563 (owner: 10Jgreen) [15:01:23] PROBLEM - Disk space on wtp1011 is CRITICAL: DISK CRITICAL - free space: / 354 MB (3% inode=77%): [15:06:11] PROBLEM - Disk space on wtp1006 is CRITICAL: DISK CRITICAL - free space: / 347 MB (3% inode=77%): [15:06:21] PROBLEM - Parsoid on wtp1011 is CRITICAL: Connection refused [15:06:33] !log taking analytics1023 offline for reinstall [15:06:39] Logged the message, Master [15:07:31] PROBLEM - DPKG on analytics1023 is CRITICAL: Timeout while attempting connection [15:07:51] PROBLEM - Host analytics1023 is DOWN: PING CRITICAL - Packet loss = 100% [15:09:21] RECOVERY - Parsoid on wtp1011 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.007 second response time [15:09:22] RECOVERY - Disk space on wtp1011 is OK: DISK OK [15:11:11] PROBLEM - Parsoid on wtp1006 is CRITICAL: Connection refused [15:13:01] RECOVERY - Host analytics1023 is UP: PING OK - Packet loss = 0%, RTA = 0.35 ms [15:13:38] (03CR) 10Reedy: [C: 032] enable url data type for test wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80541 (owner: 10Aude) [15:13:50] (03Merged) 10jenkins-bot: enable url data type for test wikidata [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80541 (owner: 10Aude) [15:14:50] thanks Reedy [15:14:56] (03PS2) 10Reedy: Add IP throttling exception for an Indian Wikipedia workshop [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80537 (owner: 10TTO) [15:15:01] (03CR) 10Reedy: [C: 032] Add IP throttling exception for an Indian Wikipedia workshop [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80537 (owner: 10TTO) [15:15:13] (03Merged) 10jenkins-bot: Add IP throttling exception for an Indian Wikipedia workshop [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80537 (owner: 10TTO) [15:15:21] PROBLEM - RAID on analytics1023 is CRITICAL: Connection refused by host [15:15:21] PROBLEM - SSH on analytics1023 is CRITICAL: Connection refused [15:15:31] PROBLEM - Disk space on analytics1023 is CRITICAL: Connection refused by host [15:15:34] (03CR) 10Reedy: [C: 032] Adjust reupload-own permissions for ckbwiki [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80546 (owner: 10TTO) [15:24:11] RECOVERY - Disk space on wtp1006 is OK: DISK OK [15:24:11] RECOVERY - Parsoid on wtp1006 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.009 second response time [15:25:21] RECOVERY - SSH on analytics1023 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [15:27:11] PROBLEM - NTP on analytics1023 is CRITICAL: NTP CRITICAL: No response from NTP server [15:28:21] PROBLEM - Disk space on wtp1017 is CRITICAL: DISK CRITICAL - free space: / 298 MB (3% inode=77%): [15:31:21] RECOVERY - Disk space on wtp1017 is OK: DISK OK [15:33:22] PROBLEM - Disk space on wtp1018 is CRITICAL: DISK CRITICAL - free space: / 264 MB (2% inode=78%): [15:38:20] PROBLEM - Parsoid on wtp1018 is CRITICAL: Connection refused [15:39:20] RECOVERY - Parsoid on wtp1018 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.003 second response time [15:39:30] RECOVERY - Disk space on wtp1018 is OK: DISK OK [15:43:59] !log disabling puppet on analytics1024 and analytics1025 in preperation for reinstall of zookeeper nodes [15:44:04] Logged the message, Master [15:45:50] (03PS1) 10Ottomata: Slight restructuring of analytics common role classes, including role::analytics::zookeeper::server on analytics102[345] [operations/puppet] - 10https://gerrit.wikimedia.org/r/80568 [15:53:16] (03CR) 10Ottomata: [C: 032 V: 032] Slight restructuring of analytics common role classes, including role::analytics::zookeeper::server on analytics102[345] [operations/puppet] - 10https://gerrit.wikimedia.org/r/80568 (owner: 10Ottomata) [16:03:19] RECOVERY - RAID on analytics1023 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [16:04:09] RECOVERY - NTP on analytics1023 is OK: NTP OK: Offset 0.02259421349 secs [16:09:34] RECOVERY - Disk space on analytics1023 is OK: DISK OK [16:14:38] paravoid: Good to see the OpenDNS issue fixed :) [16:14:42] (03PS1) 10Ottomata: Changing production zookeeper myids to match current values [operations/puppet] - 10https://gerrit.wikimedia.org/r/80573 [16:15:01] (03CR) 10Ottomata: [C: 032 V: 032] Changing production zookeeper myids to match current values [operations/puppet] - 10https://gerrit.wikimedia.org/r/80573 (owner: 10Ottomata) [16:19:00] !log reedy synchronized wmf-config/ [16:19:05] Logged the message, Master [16:22:35] Reedy: well, I didn't do anything :) [16:22:45] I did change the GeoIP db, but that's about it [16:22:55] Ssshhh. Take the credit ;) [16:23:16] I am doing something about Google Public DNS though [16:25:45] !log reedy synchronized php-1.22wmf13/extensions/WikiEditor/ [16:25:50] Logged the message, Master [16:27:14] !log reedy synchronized php-1.22wmf14/extensions/WikiEditor/ [16:27:19] Logged the message, Master [16:29:10] paravoid: what's the deal with google's dns? [16:36:16] (03PS1) 10Faidon: RT: allow login via LDAP [operations/puppet] - 10https://gerrit.wikimedia.org/r/80577 [16:36:43] greg-g: they make queries from IPs in the US(?) and we always reply eqiad, even for users coming from Europe [16:36:56] so if you live in Europe and use 8.8.8.8 you reach us via eqiad [16:37:07] hmm [16:37:12] there's a draft RFC that allows you to convey the requestor's IP information inside DNS [16:37:12] I use google DNS [16:37:27] PROBLEM - Disk space on analytics1023 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=96%): [16:37:32] and servers reply "for *this* subnet, this is the answer" [16:37:35] and we support this now [16:37:46] google dns does too, but you need to explicitly ask them to whitelist your servers [16:37:47] huh, cool [16:37:50] which I did, today [16:37:53] oh [16:37:57] cuz, google. [16:38:03] we are trust. [16:38:09] opendns supports it too and also has a whitelist [16:38:26] but opendns recursors in europe request from a european IP [16:38:30] so this isn't a problem anymore [16:38:41] http://afasterinternet.com/ has the details [16:38:47] sbernardin: there? ready to replace the disk on ms-be1? [16:38:51] (very sales oriented mail) [16:38:55] weird, so, if you are a website that has more than one datacenter, you need to tell each DNS resolver about this? So, new entrants to the DNS resolving game are screwed? [16:39:05] no [16:39:15] oh, good [16:39:17] well, kind of [16:39:28] oh, then I take it back, kind of [16:39:28] ;) [16:39:43] if you are a website that does geodns [16:39:57] to point users to different DCs [16:40:15] you need to tell users to not use centralized DNS servers [16:40:18] because they're evil [16:40:30] because you take a protocol which was nicely made to be distributed and you centralize it [16:40:37] giving google all power to know which websites you visit [16:40:37] hahahahaha [16:40:40] right, that. [16:41:05] :) [16:41:07] so, since we've broken decentralization, we need to tell each centralized one about it [16:41:28] kind of [16:41:43] every centralized one that employs a whitelist :) [16:41:48] right [16:41:55] we support edns-client-subnet for everyone [16:41:55] cuz, some don't and are even more broken? [16:42:26] everyone can make DNS requests to our servers saying "hey, I'm a resolver, and I have a client with *this* IP that asks for en.wikipedia.org. What should I answer?" [16:42:40] and we'll make the geoip lookup for the client's IP, not the resolver's [16:42:44] cmjohnson1: yes I am [16:42:59] okay..give me a sec [16:43:18] and reply "hey, for X.Y.Z.5/NN, the reply is wikipedia-lb.esams.wikimedia.org, you can cache this for this whole subnet" [16:43:45] "but if a client from a different subnet asks, ask me again" [16:44:26] RECOVERY - Disk space on analytics1023 is OK: DISK OK [16:44:27] is what google/opendns do non-standard or a standard that is just annoying? [16:44:38] big S Standard [16:45:09] * greg-g is just curious and it doesn't matter [16:45:16] sbernardin: can you verify the slot # [16:45:28] (03PS1) 10Petr Onderka: Documentation of command line parameters + other small changes [operations/dumps/incremental] (gsoc) - 10https://gerrit.wikimedia.org/r/80579 [16:45:33] One sec [16:47:12] (03CR) 10Petr Onderka: [C: 032 V: 032] Documentation of command line parameters + other small changes [operations/dumps/incremental] (gsoc) - 10https://gerrit.wikimedia.org/r/80579 (owner: 10Petr Onderka) [16:48:29] (03CR) 10coren: "My part is still +1" [operations/puppet] - 10https://gerrit.wikimedia.org/r/75087 (owner: 10Ori.livneh) [16:50:12] greg-g: not really. It's just that they offer a centralized service with a protocol that was designed to be decentralized. And now they (and we) have to work around the issues caused by this choice. They could just choose not to offer the service. But they tend to have a better QoS than most others. Plus they make NSA happier by allowing to spy on more traffic this way (just kidding. I have no idea if this is really done) [16:50:28] * greg-g answers own question [16:50:28] Are recursive DNS services who implement this now sending out truncated IP addresses to all authoritative DNS servers they communicate with? Are all authoritative DNS providers who implement this now sending back more specific responses to all recursive DNS servers? [16:50:32] Section 10.2 of the IETF draft specifies that implementers MAY use a whitelist to determine who they send the truncated IP address to, and of course, authoritative DNS may choose who they include the edns response to. To date, all recursive DNS services we are aware of operate on a whitelist basis, enabling the option only for specific authoritative nameservers or zones. This is considered a best practice. [16:50:47] the last question from: http://www.afasterinternet.com/howitworks.htm [16:51:40] dunno why it's a best practice to not trust the DNS settings of the site you're asking about [16:51:45] but, whatevs [16:52:10] (03CR) 10coren: [C: 032] "LGM" [operations/puppet] - 10https://gerrit.wikimedia.org/r/78486 (owner: 10Tim Landscheidt) [16:53:15] (03CR) 10coren: [C: 032] Tools: Enable Ganglia monitoring for Exim [operations/puppet] - 10https://gerrit.wikimedia.org/r/77848 (owner: 10Tim Landscheidt) [16:53:25] sbernardin: where did you go? please check to see if there is amber light .. shuld be slot 5 [16:53:29] (03CR) 10coren: [C: 032] Tools: Enable Ganglia monitoring for Redis [operations/puppet] - 10https://gerrit.wikimedia.org/r/77722 (owner: 10Tim Landscheidt) [16:53:34] greg-g: it's a draft IETF standard [16:53:50] it's being draft for a while, they were using a draft EDNS code as well [16:54:11] RFC 6891 was published in April which allows them to reserve a standard EDNS option code [16:54:22] and they did, that's 8 for the client subnet extension [16:54:31] Google DNS switched to that... the day before yesterday :) [16:55:42] (03CR) 10coren: [C: 032] "Is good." [operations/puppet] - 10https://gerrit.wikimedia.org/r/77234 (owner: 10Tim Landscheidt) [16:58:14] aha [16:59:35] source: https://groups.google.com/forum/#!topic/afasterinternet/HJD2WphubOg [16:59:59] cmjohnson1: it's disk 5 [17:00:25] cmjohnson1: had to go upstairs to check [17:00:40] okay...go ahead and swap the disk...thx [17:01:46] PROBLEM - Host analytics1024 is DOWN: PING CRITICAL - Packet loss = 100% [17:02:15] (03PS5) 10coren: Tools: Manage obsolete files in /usr/local/bin [operations/puppet] - 10https://gerrit.wikimedia.org/r/77234 (owner: 10Tim Landscheidt) [17:02:33] (03CR) 10coren: [C: 032] "Manual merge." [operations/puppet] - 10https://gerrit.wikimedia.org/r/77234 (owner: 10Tim Landscheidt) [17:04:16] PROBLEM - SSH on analytics1025 is CRITICAL: Connection refused [17:04:16] PROBLEM - RAID on analytics1025 is CRITICAL: Connection refused by host [17:04:16] PROBLEM - Disk space on analytics1025 is CRITICAL: Connection refused by host [17:04:56] PROBLEM - DPKG on analytics1025 is CRITICAL: Connection refused by host [17:06:56] RECOVERY - Host analytics1024 is UP: PING OK - Packet loss = 0%, RTA = 0.27 ms [17:07:56] PROBLEM - RAID on analytics1024 is CRITICAL: Connection refused by host [17:09:16] PROBLEM - Disk space on analytics1024 is CRITICAL: Connection refused by host [17:09:16] PROBLEM - SSH on analytics1024 is CRITICAL: Connection refused [17:09:26] PROBLEM - DPKG on analytics1024 is CRITICAL: Connection refused by host [17:16:16] PROBLEM - NTP on analytics1025 is CRITICAL: NTP CRITICAL: No response from NTP server [17:16:56] PROBLEM - Puppet freshness on mw1126 is CRITICAL: No successful Puppet run in the last 10 hours [17:18:29] !log disabling puppet on analytics 1006, 1008, 1009, 1021 and 1022 in preparation for reinstallation and repaving next week [17:18:34] Logged the message, Master [17:19:16] RECOVERY - SSH on analytics1025 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [17:19:16] RECOVERY - SSH on analytics1024 is OK: SSH OK - OpenSSH_5.9p1 Debian-5ubuntu1.1 (protocol 2.0) [17:20:55] hey, the twemproxy page claims WMF is using it. how's it been? good? bad? ugly? [17:21:16] PROBLEM - NTP on analytics1024 is CRITICAL: NTP CRITICAL: No response from NTP server [17:28:17] PROBLEM - MySQL Slave Delay on db1035 is CRITICAL: CRIT replication delay 214 seconds [17:28:17] PROBLEM - MySQL Replication Heartbeat on db1035 is CRITICAL: CRIT replication delay 217 seconds [17:32:16] RECOVERY - MySQL Slave Delay on db1035 is OK: OK replication delay 0 seconds [17:32:16] RECOVERY - RAID on analytics1025 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [17:32:16] RECOVERY - MySQL Replication Heartbeat on db1035 is OK: OK replication delay 0 seconds [17:33:16] RECOVERY - RAID on analytics1024 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [17:38:44] hey binasher I hear you've got twemproxy sitting in front of memcache. How well is it working? I'm looking at trying to scale redis and it seems like an option. [17:42:27] maplebed: we still just use direct connections from app servers to redis, but i'm happy with its performance with memcached. better client behavior when a server fails, and a general decrease in latency thanks to the request pipelining and working connection persistence. some of the gains though may have more to do with how lame php clients can be vs. anything else. [17:43:15] are you using any of its sharding stuff? [17:43:37] i.e. twitter uses twemproxy for all of the rails apps, but the newer scala apps talk to memcached and redis directly [17:44:05] yeah, I have no doubt that any shittiness in the php client can be equaled by the shittiness of ruby clients. [17:45:04] maplebed: i set it up with ketama consistent hashing + enabled key auto ejection when a host is down [17:45:44] so it chooses which keys go where rather than grouping by keyspace? [17:46:13] maplebed: the only annoying thing that annoys me about twemproxy is that it doesn't support config reloads, but has to be restarted [17:46:27] blech. [17:46:50] yeah [17:47:43] Well, I'm glad to hear it sounsd like it's mostly been a positive experience. [17:48:35] thanks for the rundown. [17:48:43] np [17:48:57] (03CR) 10coren: [C: 032] "Clearly correct fix." [operations/puppet] - 10https://gerrit.wikimedia.org/r/77122 (owner: 10Hashar) [17:49:05] RECOVERY - NTP on analytics1025 is OK: NTP OK: Offset -0.01272809505 secs [17:49:15] RECOVERY - NTP on analytics1024 is OK: NTP OK: Offset -0.01610136032 secs [17:50:15] PROBLEM - Packetloss_Average on analytics1008 is CRITICAL: CRITICAL: packet_loss_average is 53.8892575333 (gt 8.0) [17:50:26] (03PS3) 10coren: role::eqiad-proxy: file[] -> File[] [operations/puppet] - 10https://gerrit.wikimedia.org/r/77121 (owner: 10Hashar) [17:51:18] (03CR) 10coren: [C: 032] "Trivially correct." [operations/puppet] - 10https://gerrit.wikimedia.org/r/77121 (owner: 10Hashar) [17:56:15] RECOVERY - Disk space on analytics1024 is OK: DISK OK [18:00:15] RECOVERY - Packetloss_Average on analytics1008 is OK: OK: packet_loss_average is 2.2553997037 [18:03:15] RECOVERY - Disk space on analytics1025 is OK: DISK OK [18:22:07] (03PS3) 10Tim Landscheidt: Tools: Allow bastions to access other hosts with HBA [operations/puppet] - 10https://gerrit.wikimedia.org/r/77144 [18:22:30] (03CR) 10Demon: "(1 comment)" [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80175 (owner: 10Tim Starling) [18:53:01] (03PS1) 10Jgreen: tweak spamassassin scoring for otrs [operations/puppet] - 10https://gerrit.wikimedia.org/r/80588 [18:53:47] (03CR) 10Jgreen: [C: 032 V: 031] tweak spamassassin scoring for otrs [operations/puppet] - 10https://gerrit.wikimedia.org/r/80588 (owner: 10Jgreen) [19:00:52] PROBLEM - LVS HTTP IPv4 on misc-web-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:01:22] PROBLEM - LVS HTTPS IPv4 on misc-web-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:01:52] PROBLEM - LVS HTTP IPv6 on misc-web-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:02:32] PROBLEM - LVS HTTPS IPv6 on misc-web-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:03:32] RECOVERY - LVS HTTPS IPv6 on misc-web-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 248 bytes in 6.809 second response time [19:03:42] RECOVERY - LVS HTTP IPv6 on misc-web-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 260 bytes in 0.001 second response time [19:03:45] RECOVERY - LVS HTTP IPv4 on misc-web-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 259 bytes in 0.003 second response time [19:04:04] hey dudes, I may have asked this efore [19:04:05] but [19:04:12] whats' the best way to remove icinga alerts? [19:04:12] RECOVERY - LVS HTTPS IPv4 on misc-web-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 251 bytes in 0.005 second response time [19:04:22] through puppet or manually? [19:07:58] (03CR) 10Ryan Lane: "(1 comment)" [operations/puppet] - 10https://gerrit.wikimedia.org/r/80489 (owner: 10Demon) [19:09:34] (03CR) 10Ryan Lane: [C: 032] Tee centralauth wfDebug()s to vanadium for Gangliafication [operations/puppet] - 10https://gerrit.wikimedia.org/r/80164 (owner: 10Ori.livneh) [19:10:14] ottomata: we do everything through puppet [19:12:38] right, but Ryan_Lane, i'm bascially reinstalling servers that won't have the same nagios checks anymore [19:12:41] icinga* [19:12:57] so, I guess I need to figure out how to ensure => absent on monitor_serivces? [19:13:02] these are exported resources too [19:13:18] so i have to run puppet on the hosts and on nickel? [19:13:40] s/nickel/neon [19:14:20] ottomata: what do you mean reinstalling? [19:14:28] as in actually reinstalling? [19:14:30] yes [19:14:41] but also, turning off some things that we won't be using anymore [19:14:45] like, some analytics udp2log instances [19:14:46] well, icinga uses exported resources, so it should remove them [19:15:38] like if the exported resources go stale, i.e. puppet hasn't run with those resources on the nodes for a while [19:15:59] puppet will drop relevant configs on neon? [19:16:09] it should, yes [19:16:51] hm, ok cool, then I won't worry about it :) [19:17:06] thanks! [19:17:13] if that doesn't just work, it would be lame and puppet would suck [19:17:17] of course, we know how puppet is [19:17:24] hehe, aye [19:21:32] PROBLEM - SSH on pdf1 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:23:52] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:22] RECOVERY - SSH on pdf1 is OK: SSH OK - OpenSSH_4.7p1 Debian-8ubuntu3 (protocol 2.0) [19:24:42] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.124 second response time [19:45:35] <^d> Ryan_Lane: https://gerrit.wikimedia.org/r/#/c/79968/ is trivial. Working on your comments on the other patch. [19:52:55] (03PS1) 10Ottomata: Adding a cron job to run zkCleanup.sh daily. [operations/puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/80654 [19:53:20] !log ns2.wikimedia.org glue record updated from 91.198.174.4 to 91.198.174.239 [19:53:26] Logged the message, RobH [19:53:40] (03PS2) 10Ottomata: Adding a cron job to run zkCleanup.sh daily. [operations/puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/80654 [19:54:17] (03PS3) 10Ottomata: Adding a cron job to run zkCleanup.sh daily. [operations/puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/80654 [19:54:29] (03CR) 10Ottomata: [C: 032 V: 032] Adding a cron job to run zkCleanup.sh daily. [operations/puppet/zookeeper] - 10https://gerrit.wikimedia.org/r/80654 (owner: 10Ottomata) [19:55:28] (03PS1) 10Ottomata: Updating zookeeper module with zkCleanup.sh cron job commit [operations/puppet] - 10https://gerrit.wikimedia.org/r/80655 [19:56:07] (03CR) 10Ottomata: [C: 032 V: 032] Updating zookeeper module with zkCleanup.sh cron job commit [operations/puppet] - 10https://gerrit.wikimedia.org/r/80655 (owner: 10Ottomata) [20:01:49] (03PS1) 10Jgreen: exim adds Precedence: bulk for OTRS when it finds X-List-* headers [operations/puppet] - 10https://gerrit.wikimedia.org/r/80657 [20:15:20] (03CR) 10Jgreen: [C: 032 V: 031] exim adds Precedence: bulk for OTRS when it finds X-List-* headers [operations/puppet] - 10https://gerrit.wikimedia.org/r/80657 (owner: 10Jgreen) [20:17:11] (03PS2) 10Demon: Restructure replication in preparation of moving off manganese [operations/puppet] - 10https://gerrit.wikimedia.org/r/80489 [20:26:17] (03PS1) 10Dr0ptp4kt: Adding Google Webmaster Tools verification file for de-indexing Wikipedia Zero. [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80660 [20:31:34] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 0 seconds [20:31:54] PROBLEM - Puppet freshness on eeden is CRITICAL: No successful Puppet run in the last 10 hours [20:32:25] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 0 seconds [20:33:12] uh, help? what happened to db1047? [20:33:21] !log taking mysqld down on db1047 (s1 analytics + eventlog data) for unplanned maintenance [20:33:26] Logged the message, Master [20:35:01] heh [20:35:25] PROBLEM - mysqld processes on db1047 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld [20:35:41] (03CR) 10coren: [C: 032] "This is traditional googlespeak." [operations/mediawiki-config] - 10https://gerrit.wikimedia.org/r/80660 (owner: 10Dr0ptp4kt) [20:36:30] Hm. How does one push the docroot once merged? [20:37:49] !log Stopped eventlogging/consumer NAME=mysql-db1047 on vanadium [20:37:51] ori-l: sorry, some super crazy long running analytics transactions backed up purging for too long, logs filled and everything crawled to a halt. i want to increase the number of purge threads as part of cleanup, but requires a full server restart. db1047 was hosed as far as keeping up with enwiki data even if the eventlog tables writes were fine. maybe eventlog should get split off [20:37:55] Logged the message, Master [20:38:22] RECOVERY - mysqld processes on db1047 is OK: PROCS OK: 1 process with command name mysqld [20:38:29] binasher: kk, no worries, there's actually no critical data collection job running at the moment afaik [20:38:32] (03PS1) 10Jgreen: exim tags messages if it bypasses spamassassin check [operations/puppet] - 10https://gerrit.wikimedia.org/r/80662 [20:38:46] some ongoing ones (mobile activity, etc.), but i'll let them know [20:39:44] (03CR) 10Jgreen: [C: 032 V: 031] exim tags messages if it bypasses spamassassin check [operations/puppet] - 10https://gerrit.wikimedia.org/r/80662 (owner: 10Jgreen) [20:40:41] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 301636 seconds [20:41:41] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 0 seconds [20:42:06] Coren, sync-docroot [20:42:19] MaxSem: I had just figured that out. [20:43:31] !log root synchronized docroot [20:43:34] MaxSem: couldn't find doc for it, but after reading about sync-apache I looked around for sync-docroot and there it was. :-) [20:43:37] Logged the message, Master [20:45:00] from module reusability talk: use ensure_packages from puppet stdlib vs. package ensure => .., to avoid duplicate definitions, it ensures they are installed only if they arent already but doesnt cause conflicts when combining modules who both want to pull the same stuff [20:45:01] Coren, omg you're deploying as root? is there a chance that files owned by root appear? [20:45:11] ensure_packages "Takes a list of packages and only installs them if they don't already exist." [20:45:39] MaxSem: ... hum. Yes, I expect the newly added file would be. I take it I just goofed and should have used an account just for this? [20:45:47] * Coren damns the doc. [20:45:53] yup [20:46:11] MaxSem: a hunt-and-chown for the file should be trivial enough though. What account should I have used? [20:46:32] and then someone (most likely Reedy) will come to this channel asking to fix:) [20:46:41] ori-l: btw, ryan merged your logging stuff for centralauth, I believe. are there any more peices of that I should know about? [20:46:43] Coren, an account in wikidev [20:47:47] I shall fix myself preemptively then. :-) [20:49:09] MaxSem: Ah, actually, no-- that will have worked without making the files owned by root. The dsh that is invoked does an explicit sudo to mwdeploy. :-) [20:49:22] PROBLEM - MySQL Replication Heartbeat on db1047 is CRITICAL: CRIT replication delay 302164 seconds [20:49:30] yup, but what about git pull? [20:49:54] anyway, I don't see any files that shouldn't be owned by root [20:50:11] so danger averted:) [20:51:14] Heh. Because I did a noop -- I just noticed that the place sync-docroot rsyncs /from/ isn't pulled from git implicitly as I had expected the script to do. :-) [20:51:37] greg-g: i'll ping you once i'm out of the meeting if it's ok [20:52:04] heya paravoid, you there? [20:52:08] quick packaging question for you [20:52:14] !log Re-starting eventlogging/consumer NAME=mysql-db1047 on vanadium [20:52:19] Logged the message, Master [20:52:41] PROBLEM - MySQL Slave Delay on db1047 is CRITICAL: CRIT replication delay 302339 seconds [20:54:05] back in 10 [20:57:53] !log marc synchronized docroot [20:57:54] MaxSem: did it right that time. :-) [20:57:58] Logged the message, Master [21:00:58] ori-l: no worries [21:02:19] How long does the negative caching last for a file newly added to docroot? [21:02:38] (I had a miss on en before I did it right, now I can see it everywhere /else/) [21:07:36] back [21:11:15] Ryan_Lane: i don't think the config file change refreshed the udp2log / udp2log-mw services [21:11:21] on fluorine, Imean [21:12:11] "/etc/init.d/udp2log-mw restart" as root on fluorine shld do it [21:22:31] any idea why we'd have some page purge issues? Just found a page that was reverted 8 hours ago but for some reason was never purged after the edit (the diff showed it happened but the live page showed the previous edit until I manually purged it) [21:24:19] ja so paravoid [21:24:25] i want to build kafka off of the 0.8 branch [21:24:26] not trunk [21:24:37] alex originally created the debian branch from trunk [21:24:43] i'm not sure the best way to proceed [21:24:52] should I just branch 0.8 and cherry pick the debian commits? [21:25:07] i tried merging, but get a bunch of conflicts that have nothing to do with debian/ stuff [21:25:22] PROBLEM - LVS HTTPS IPv6 on misc-web-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 325 bytes in 0.017 second response time [21:26:52] PROBLEM - LVS HTTP IPv4 on misc-web-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:27:09] should I create a new branch and just cherry pick what I want from debian branch? [21:27:22] PROBLEM - LVS HTTPS IPv4 on misc-web-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:27:52] PROBLEM - LVS HTTP IPv6 on misc-web-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [21:28:22] Jamesofur: i remember it being reported a few days ago, but can't find a bug [21:28:38] :-/ [21:29:13] RECOVERY - LVS HTTPS IPv4 on misc-web-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 250 bytes in 3.377 second response time [21:29:24] it's one thing for someone to get annoyed when something was left un reverted for 8 hours lol, another thing for it to have been reverted within a minute but never cleared lol [21:29:28] RECOVERY - LVS HTTPS IPv6 on misc-web-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 249 bytes in 0.093 second response time [21:29:42] RECOVERY - LVS HTTP IPv6 on misc-web-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 260 bytes in 0.008 second response time [21:29:45] RECOVERY - LVS HTTP IPv4 on misc-web-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 261 bytes in 0.008 second response time [21:30:33] hrm, what url is being used to monitor misc-web [21:30:47] the little misc varnish cluster looks fine and completely unloaded [21:30:51] * preilly kicks binasher  [21:31:05] wtf [21:31:42] binasher: okay I take back my kick [21:32:29] Does anybody want to go to the Giants game September 7th and get VIP access to the Dugout Club? [21:32:47] !restarted udpl2log-mw on fluorine for ori-l [21:32:59] http://www.sfgiantsseatingchart.com/san-francisco-giants-lexus-dugout-club/#axzz2cpXm1SV8 [21:33:03] ottomata: thanks! [21:33:31] monitoring misc-web via a git.wikimedia.org url probably isn't a good idea given gitblit stability [21:34:01] Jamesofur: wanna report one? :P [21:35:27] if it hasn't been yes, definitely, though of course now that I've purged it I imagine people will say they can't track it down :P [21:35:38] and when there is a BLP issue I'm always going to purge rather then leave it lol [21:35:43] ori-l: yw [21:35:52] trying to see if I can find another example I'm able to leave up [21:38:10] ori-l: restarted [21:38:16] ah, crap [21:38:21] ottomata already had [21:38:52] ah well :) [21:39:36] thanks anyhow :) [21:40:41] greg-g: https://dpaste.de/FzIwk/raw/ shows the log message templates that are streaming to vanadium now, can you identify a short list of which ones you want to convert into a metric? [21:42:04] PROBLEM - Disk space on wtp1007 is CRITICAL: DISK CRITICAL - free space: / 346 MB (3% inode=77%): [21:42:32] robla: ^^ [21:42:40] ori-l: I *think* just 1463-"authentication for '$this->mName' succeeded" ); [21:43:15] there might be interesting things to learn from the other ones, but I don't know enough about the steps pre/post some of those to infer much [21:46:45] ori-l: and rob's not sure what else would be useful (just talked IRL) [21:47:25] PROBLEM - Parsoid on wtp1007 is CRITICAL: Connection refused [21:54:38] * robla reads backlog now [21:55:25] PROBLEM - MySQL Slave Delay on db55 is CRITICAL: CRIT replication delay 190 seconds [21:55:44] PROBLEM - MySQL Replication Heartbeat on db55 is CRITICAL: CRIT replication delay 195 seconds [21:59:26] ori-l: where are you? :P [22:05:25] RECOVERY - MySQL Slave Delay on db55 is OK: OK replication delay 17 seconds [22:05:44] RECOVERY - MySQL Replication Heartbeat on db55 is OK: OK replication delay 23 seconds [22:11:25] greg-g: sixth floor, interminable meeting [22:11:57] RECOVERY - Disk space on wtp1007 is OK: DISK OK [22:12:27] RECOVERY - Parsoid on wtp1007 is OK: HTTP OK: HTTP/1.1 200 OK - 1373 bytes in 0.014 second response time [22:13:12] ori-l: fine. basic question is: can we figure out from logs if a person is able to successfully view the page they are redirected to after log in? Login is great, but then being redirected bacvk to where you are and seeing something is better ;) [22:15:27] PROBLEM - LVS HTTPS IPv4 on misc-web-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:15:37] PROBLEM - LVS HTTPS IPv6 on misc-web-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:15:57] PROBLEM - LVS HTTP IPv4 on misc-web-lb.eqiad.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:16:00] PROBLEM - LVS HTTP IPv6 on misc-web-lb.eqiad.wikimedia.org_ipv6 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [22:16:39] what's with misc? [22:16:47] RECOVERY - LVS HTTP IPv6 on misc-web-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 259 bytes in 0.001 second response time [22:16:49] RECOVERY - LVS HTTP IPv4 on misc-web-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 260 bytes in 0.001 second response time [22:16:57] PROBLEM - Puppet freshness on cp1063 is CRITICAL: No successful Puppet run in the last 10 hours [22:17:17] RECOVERY - LVS HTTPS IPv4 on misc-web-lb.eqiad.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 251 bytes in 0.005 second response time [22:17:27] RECOVERY - LVS HTTPS IPv6 on misc-web-lb.eqiad.wikimedia.org_ipv6 is OK: HTTP OK: HTTP/1.1 200 OK - 249 bytes in 0.013 second response time [22:33:30] (03PS1) 10Asher: for misc-web-lb, just monitor varnish, not services behind it. no paging for single host services without redundancy. [operations/puppet] - 10https://gerrit.wikimedia.org/r/80679 [22:41:17] (03CR) 10Asher: [C: 032 V: 032] for misc-web-lb, just monitor varnish, not services behind it. no paging for single host services without redundancy. [operations/puppet] - 10https://gerrit.wikimedia.org/r/80679 (owner: 10Asher) [23:21:27] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:24:17] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [23:34:57] PROBLEM - Puppet freshness on fenari is CRITICAL: No successful Puppet run in the last 10 hours [23:52:20] PROBLEM - Puppetmaster HTTPS on stafford is CRITICAL: CRITICAL - Socket timeout after 10 seconds [23:53:12] RECOVERY - Puppetmaster HTTPS on stafford is OK: HTTP OK: Status line output matched 400 - 336 bytes in 0.125 second response time [23:57:07] !log removing mailing list Wikipedia-ensino, creating list educacao-br [23:57:12] Logged the message, Master