[00:58:38] RECOVERY - MySQL Replication Heartbeat on db1047 is OK: OK replication delay 0 seconds [00:58:56] RECOVERY - MySQL Slave Delay on db1047 is OK: OK replication delay 0 seconds [01:16:29] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [01:39:36] PROBLEM - Disk space on stafford is CRITICAL: DISK CRITICAL - free space: /var/lib/puppet 758 MB (3% inode=92%): [01:55:03] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [03:47:09] https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#is_it_me_or_is_wiki_very_slow.3F [03:47:14] Is someone looking into that? [04:29:05] RECOVERY - Disk space on stafford is OK: DISK OK [04:31:23] Joan: I concur with their results [04:32:07] That's helpful. [07:36:30] PROBLEM - Disk space on search1021 is CRITICAL: DISK CRITICAL - free space: /a 4045 MB (3% inode=99%): [07:36:39] PROBLEM - Disk space on search1022 is CRITICAL: DISK CRITICAL - free space: /a 4048 MB (3% inode=99%): [08:00:39] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [08:02:45] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [08:12:39] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [08:12:39] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [08:55:20] New patchset: Hashar; "abstract logic getting irc filename, add tests" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3527 [08:55:35] New patchset: Hashar; "reindent / align hookconfig.py $filename hash" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3526 [08:55:49] New patchset: Hashar; "remove +x bits from files of /srv/org/mediawiki/integration" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3433 [08:56:04] New patchset: Hashar; "gerrit IRC bot now join #wikimedia-dev" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3530 [08:56:18] New patchset: Hashar; "use wildcards for gerrit IRC notifications" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3529 [08:56:32] New patchset: Hashar; "support project wildcard for irc notifications" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3528 [08:56:45] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3527 [08:56:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3526 [08:56:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3433 [08:56:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3530 [08:56:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3529 [08:56:46] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3528 [08:57:25] New review: Hashar; "(no comment)" [operations/puppet] (production) C: 1; - https://gerrit.wikimedia.org/r/3512 [11:18:06] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [11:22:36] PROBLEM - Backend Squid HTTP on sq34 is CRITICAL: Connection refused [11:57:06] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours [12:43:35] !log Rack mounted and powered up cr2-knams [12:43:39] Logged the message, Master [12:44:11] !log Disabled br1-knams:e1/2 (DF leg 1 to esams) [12:44:14] Logged the message, Master [12:48:00] !log Moved fiber from br1-knams:e1/2 to cr2-knams:xe-0/0/0 [12:48:04] Logged the message, Master [12:48:26] PROBLEM - Backend Squid HTTP on knsq19 is CRITICAL: No route to host [12:48:35] PROBLEM - Frontend Squid HTTP on knsq25 is CRITICAL: No route to host [12:50:32] RECOVERY - Backend Squid HTTP on knsq19 is OK: HTTP OK HTTP/1.0 200 OK - 634 bytes in 0.218 seconds [12:50:41] RECOVERY - Frontend Squid HTTP on knsq25 is OK: HTTP OK HTTP/1.0 200 OK - 27735 bytes in 0.440 seconds [13:14:52] !log Established full iBGP mesh with added router cr2-knams. cr2-knams now has full Internet connectivity. [13:14:56] Logged the message, Master [13:22:10] New patchset: Mark Bergsma; "Add new router cr2-knams to Torrus monitoring" [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3686 [13:22:24] New review: gerrit2; "Lint check passed." [operations/puppet] (production); V: 1 - https://gerrit.wikimedia.org/r/3686 [13:22:34] New review: Mark Bergsma; "(no comment)" [operations/puppet] (production); V: 0 C: 2; - https://gerrit.wikimedia.org/r/3686 [13:22:37] Change merged: Mark Bergsma; [operations/puppet] (production) - https://gerrit.wikimedia.org/r/3686 [13:57:52] !log Shutdown AS1299 BGP session on br1-knams [13:57:55] Logged the message, Master [14:18:48] !log Brought up AS13030 and AS1299 BGP sessions on cr2-knams [14:18:52] Logged the message, Master [14:49:11] !log Moved AS6908 and AS1257 PIs to cr2-knams [14:49:14] Logged the message, Master [15:06:38] Any thoughts on https://bugzilla.wikimedia.org/show_bug.cgi?id=35448 ? [15:06:44] !log Disabled BFD on OSPF3 between cr2-knams and csw1-esams [15:06:48] Logged the message, Master [15:07:28] I assume response times of bits, geoiplookup, etc. are already being measured? [15:07:35] Is there a way to see if there's been an increase in response time? [15:15:10] PROBLEM - Router interfaces on mr1-eqiad is CRITICAL: CRITICAL: No response from remote host 10.65.0.1 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 [15:17:16] RECOVERY - Router interfaces on mr1-eqiad is OK: OK: host 10.65.0.1, interfaces up: 32, down: 0, dormant: 0, excluded: 0, unused: 0 [15:22:19] !log Shutdown all AMS-IX BGP sessions [15:22:22] Logged the message, Master [15:25:03] !log Moved AMS-IX connection to cr2-knams:xe-1/1/0 [15:25:07] Logged the message, Master [15:30:40] !log Brought up AMS-IX ipv6 BGP sessions [15:30:44] Logged the message, Master [15:40:39] !log Brought up AMS-IX ipv4 BGP sessions [15:40:42] Logged the message, Master [16:01:53] !log Cleared OSPF session between csw1-esams and csw2-esams which magically made some internal routes reappear [16:01:57] Logged the message, Master [16:02:46] PROBLEM - MySQL Replication Heartbeat on db52 is CRITICAL: CRIT replication delay 320 seconds [16:03:13] PROBLEM - MySQL Slave Delay on db52 is CRITICAL: CRIT replication delay 346 seconds [16:35:24] sup with pages? [16:37:58] anybody looking at commons yet? [16:38:03] yeah, but nothing's sticking out [16:38:11] it's fine I think [16:38:13] it's actually up and working? [16:38:15] I had a mistake in the new router config [16:38:17] in esams [16:39:34] yeah, dbs look all green and I'm able to use commons [16:39:48] there's recovery [16:39:50] I'm off! [16:39:52] ttfn [16:39:59] sorry guys [16:40:08] no prob [16:40:11] we forgive you [16:40:25] payback for when I brought up those vips the other day ;) [16:40:28] cya. [16:40:31] hehehe [16:40:35] c'ya [16:40:41] bye [17:02:12] PROBLEM - Host br1-knams is DOWN: PING CRITICAL - Packet loss = 100% [17:07:00] PROBLEM - Apache HTTP on srv278 is CRITICAL: Connection refused [17:09:33] !log Migrated second knams-esams dark fiber link from br1-knams to cr2-knams [17:09:36] Logged the message, Master [17:10:45] PROBLEM - BGP status on csw2-esams is CRITICAL: CRITICAL: host 91.198.174.244, sessions up: 3, down: 2, shutdown: 0BRPeering with AS64600 not established - BRPeering with AS43821 not established - WIKIMEDIA-EUBR [17:21:15] RECOVERY - BGP status on csw2-esams is OK: OK: host 91.198.174.244, sessions up: 4, down: 0, shutdown: 0 [17:28:00] RECOVERY - Apache HTTP on srv278 is OK: HTTP OK - HTTP/1.1 301 Moved Permanently - 0.042 second response time [17:35:47] !log Migration from br1-knams to cr2-knams completed. [17:35:51] Logged the message, Master [17:49:35] ok... [17:49:38] i'm gonna head out of here [18:02:12] PROBLEM - Puppet freshness on owa3 is CRITICAL: Puppet has not run in the last 10 hours [18:04:44] PROBLEM - Puppet freshness on amslvs2 is CRITICAL: Puppet has not run in the last 10 hours [18:07:51] congrats on the migration [18:13:44] PROBLEM - Puppet freshness on owa1 is CRITICAL: Puppet has not run in the last 10 hours [18:13:44] PROBLEM - Puppet freshness on owa2 is CRITICAL: Puppet has not run in the last 10 hours [19:44:35] PROBLEM - Disk space on srv220 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=61%): /var/lib/ureadahead/debugfs 0 MB (0% inode=61%): [19:44:35] PROBLEM - Disk space on srv222 is CRITICAL: DISK CRITICAL - free space: / 195 MB (2% inode=61%): /var/lib/ureadahead/debugfs 195 MB (2% inode=61%): [19:44:35] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 278 MB (3% inode=61%): /var/lib/ureadahead/debugfs 278 MB (3% inode=61%): [19:48:47] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 211 MB (2% inode=61%): /var/lib/ureadahead/debugfs 211 MB (2% inode=61%): [19:48:47] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 49 MB (0% inode=61%): /var/lib/ureadahead/debugfs 49 MB (0% inode=61%): [19:50:53] RECOVERY - Disk space on srv222 is OK: DISK OK [19:55:14] RECOVERY - Disk space on srv220 is OK: DISK OK [19:57:02] RECOVERY - Disk space on srv221 is OK: DISK OK [20:01:14] RECOVERY - Disk space on srv224 is OK: DISK OK [20:01:23] RECOVERY - Disk space on srv219 is OK: DISK OK [20:43:23] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 248 MB (3% inode=61%): /var/lib/ureadahead/debugfs 248 MB (3% inode=61%): [20:57:50] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 92 MB (1% inode=61%): /var/lib/ureadahead/debugfs 92 MB (1% inode=61%): [20:59:56] PROBLEM - Disk space on srv219 is CRITICAL: DISK CRITICAL - free space: / 148 MB (2% inode=61%): /var/lib/ureadahead/debugfs 148 MB (2% inode=61%): [21:04:08] PROBLEM - Disk space on srv223 is CRITICAL: DISK CRITICAL - free space: / 192 MB (2% inode=61%): /var/lib/ureadahead/debugfs 192 MB (2% inode=61%): [21:04:17] PROBLEM - Disk space on srv224 is CRITICAL: DISK CRITICAL - free space: / 210 MB (2% inode=61%): /var/lib/ureadahead/debugfs 210 MB (2% inode=61%): [21:04:17] PROBLEM - Disk space on srv221 is CRITICAL: DISK CRITICAL - free space: / 277 MB (3% inode=61%): /var/lib/ureadahead/debugfs 277 MB (3% inode=61%): [21:06:23] RECOVERY - Disk space on srv224 is OK: DISK OK [21:10:26] RECOVERY - Disk space on srv219 is OK: DISK OK [21:10:26] RECOVERY - Disk space on srv223 is OK: DISK OK [21:10:35] RECOVERY - Disk space on srv221 is OK: DISK OK [21:19:26] PROBLEM - Puppet freshness on db59 is CRITICAL: Puppet has not run in the last 10 hours [21:58:26] PROBLEM - Puppet freshness on amslvs4 is CRITICAL: Puppet has not run in the last 10 hours