[00:18:12] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:20:03] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.022 second response time [00:38:43] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:39:10] hmm [00:39:26] see I can hit it and it works [00:39:39] so I'm not entirely sure what icinga-wm is seeing [00:40:43] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 5.057 second response time [00:52:28] (03CR) 10Mholloway: [C: 04-1] "Putting a -1 on this for now, probably needs discussion." [puppet] - 10https://gerrit.wikimedia.org/r/293887 (owner: 10Mholloway) [00:54:53] PROBLEM - dhclient process on rutherfordium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:55:13] PROBLEM - configured eth on rutherfordium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:55:14] PROBLEM - puppet last run on rutherfordium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:55:32] PROBLEM - salt-minion processes on rutherfordium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:55:33] PROBLEM - Check size of conntrack table on rutherfordium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:55:45] PROBLEM - DPKG on rutherfordium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:55:53] PROBLEM - Disk space on rutherfordium is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [00:56:22] PROBLEM - HTTP-peopleweb on rutherfordium is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:58:02] ganeti vm lookup issue, people.wm.org ,looking [00:59:43] RECOVERY - DPKG on rutherfordium is OK: All packages OK [00:59:54] RECOVERY - Disk space on rutherfordium is OK: DISK OK [01:00:23] RECOVERY - HTTP-peopleweb on rutherfordium is OK: HTTP OK: HTTP/1.1 200 OK - 1699 bytes in 0.006 second response time [01:00:36] ok, just connected to console, came back, the type where i just connected but dont even run commands [01:00:41] leaves again [01:01:03] RECOVERY - dhclient process on rutherfordium is OK: PROCS OK: 0 processes with command name dhclient [01:01:22] RECOVERY - configured eth on rutherfordium is OK: OK - interfaces up [01:01:32] RECOVERY - puppet last run on rutherfordium is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [01:01:36] !log rutherfordium ganeti lockup, gnt-instance console .. and it recovered [01:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [01:01:42] RECOVERY - salt-minion processes on rutherfordium is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [01:01:43] RECOVERY - Check size of conntrack table on rutherfordium is OK: OK: nf_conntrack is 0 % full [01:29:43] PROBLEM - Disk space on lithium is CRITICAL: DISK CRITICAL - free space: /srv/syslog 11380 MB (3% inode=99%) [01:50:42] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [01:52:32] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.017 second response time [02:07:03] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 398 bytes in 0.004 second response time [02:09:03] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 2.291 second response time [02:29:52] !log mwdeploy@tin scap sync-l10n completed (1.28.0-wmf.5) (duration: 11m 42s) [02:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:32:03] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: puppet fail [02:36:22] !log l10nupdate@tin ResourceLoader cache refresh completed at Sat Jun 11 02:36:22 UTC 2016 (duration 6m 30s) [02:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:42:11] PROBLEM - puppet last run on mw2190 is CRITICAL: CRITICAL: puppet fail [02:49:30] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures [02:50:09] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [02:51:59] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 0.021 second response time [02:56:31] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [02:56:35] 06Operations, 10DBA, 06Labs: disk failure on labsdb1002 - https://phabricator.wikimedia.org/T126946#2373304 (10zhuyifei1999) >>! In T126946#2371339, @Blahma wrote: > A confirmed work around is to remove categories from the concerned articles, save and add immediately revert, but I have been questioned by a p... [02:58:17] (03CR) 10MZMcBride: Avoid breaking full phabricator URLs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/256663 (https://phabricator.wikimedia.org/T75997) (owner: 10Thiemo Mättig (WMDE)) [03:09:33] RECOVERY - puppet last run on mw2190 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [03:11:52] !log ori@tin Synchronized php-1.28.0-wmf.5/includes/parser/CacheTime.php: ad-hoc logging of updateCacheExpiry(0) traces (duration: 00m 25s) [03:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:14:13] !log ori@tin Synchronized php-1.28.0-wmf.5/includes/parser/CacheTime.php: remove ad-hoc logging of updateCacheExpiry(0) traces (duration: 00m 23s) [03:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:16:43] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures [03:34:50] (am hacking on labmon / graphite.wmflabs.org) [03:36:52] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: Connection refused [03:38:29] I think the underlying problem... [03:38:36] is just apache2 being used there for no good reason [03:41:01] no good reason in wmflabs, that is [03:41:02] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1590 bytes in 8.204 second response time [03:41:07] it is used for ldap auth in prod [03:41:19] and uses the outdated mod_uwsgi rather than mod_proxy_uwsgi [03:43:03] PROBLEM - puppet last run on labmon1001 is CRITICAL: CRITICAL: Puppet has 1 failures [03:51:03] RECOVERY - puppet last run on labmon1001 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [04:01:17] alright, now it's running with mod_proxy [04:01:23] let's see how long that lasts and if it still keeps flapping [04:01:32] apache is spawning way too many things [04:07:33] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:15:42] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 9.359 second response time [04:16:33] PROBLEM - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - string OK not found on http://checker.tools.wmflabs.org:80/dumps - 288 bytes in 0.019 second response time [04:27:53] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [04:29:53] well fuck [04:30:13] I've no idea wtf is going on with labmon [04:32:03] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 3.325 second response time [04:40:12] I set harikiri to 10s now [04:40:16] let's see if that makes it more solid [04:48:13] 8mins with no flap!! [04:49:52] PROBLEM - puppet last run on mw2188 is CRITICAL: CRITICAL: Puppet has 1 failures [05:05:37] 25 minutes with no flaps! [05:07:44] 06Operations, 10ops-eqiad: install/setup new labmon1001 system - https://phabricator.wikimedia.org/T136972#2373388 (10yuvipanda) 05Open>03Resolved This was done [05:09:30] 06Operations, 10ops-eqiad, 06Labs, 10Labs-Infrastructure: Reinstall labmon1001 with new disk configuration (and jessie) - https://phabricator.wikimedia.org/T136227#2373391 (10yuvipanda) 05Open>03declined The machine died on the table, never quite came back up from a restart. We did T136972 instead [05:09:59] 06Operations, 10ops-eqiad: install/setup new labmon1001 system - https://phabricator.wikimedia.org/T136972#2354048 (10yuvipanda) [05:16:10] RECOVERY - puppet last run on mw2188 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [05:30:42] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [05:33:47] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 2.976 second response time [06:13:13] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:14:55] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 0.020 second response time [06:29:13] 06Operations, 06Discovery, 06Maps, 07Epic: Epic: cultivating the Maps garden - https://phabricator.wikimedia.org/T137616#2373414 (10Gehel) [06:29:58] 06Operations, 06Discovery, 06Maps: Tune thread for osm2pgsql / postgres max connections for Maps - https://phabricator.wikimedia.org/T137229#2373430 (10Gehel) [06:30:00] 06Operations, 06Discovery, 06Maps, 07Epic: Epic: cultivating the Maps garden - https://phabricator.wikimedia.org/T137616#2373428 (10Gehel) [06:30:10] 06Operations, 06Discovery, 06Maps, 07Epic: Epic: switch Maps to production status - https://phabricator.wikimedia.org/T133744#2241419 (10Gehel) [06:30:12] 06Operations, 06Discovery, 06Maps: Tune thread for osm2pgsql / postgres max connections for Maps - https://phabricator.wikimedia.org/T137229#2361550 (10Gehel) [06:30:34] PROBLEM - puppet last run on scb1001 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:34] PROBLEM - puppet last run on mw2145 is CRITICAL: CRITICAL: Puppet has 1 failures [06:30:54] PROBLEM - puppet last run on db1028 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:04] PROBLEM - puppet last run on mw1135 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:14] PROBLEM - puppet last run on einsteinium is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:33] PROBLEM - puppet last run on cp1053 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:53] PROBLEM - puppet last run on restbase1007 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:53] PROBLEM - puppet last run on druid1002 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:14] PROBLEM - puppet last run on mw2126 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:33] PROBLEM - puppet last run on db2058 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:34] PROBLEM - puppet last run on neodymium is CRITICAL: CRITICAL: Puppet has 1 failures [06:34:53] PROBLEM - Disk space on logstash1001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=97%) [06:42:13] 06Operations, 06Discovery, 06Maps, 07Epic: Enable specs on Katotherian service - https://phabricator.wikimedia.org/T137617#2373434 (10Gehel) [06:46:38] 06Operations, 06Discovery, 06Maps: Send logs to logstash for maps services (katotherian, tilerator, tileratorui) - https://phabricator.wikimedia.org/T137618#2373448 (10Gehel) [06:46:52] 06Operations, 06Discovery, 06Maps: Enable specs on Katotherian service - https://phabricator.wikimedia.org/T137617#2373462 (10Gehel) [06:56:04] RECOVERY - puppet last run on restbase1007 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [06:56:44] RECOVERY - puppet last run on db2058 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:53] RECOVERY - puppet last run on neodymium is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [06:56:53] RECOVERY - puppet last run on scb1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:54] RECOVERY - puppet last run on mw2145 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:14] RECOVERY - puppet last run on db1028 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:24] RECOVERY - puppet last run on mw1135 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:33] RECOVERY - puppet last run on einsteinium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:44] RECOVERY - puppet last run on cp1053 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:49] 06Operations, 06Discovery, 06Maps: Ensure that Kartotherian on new maps200? servers are sending metrics to Graphite - https://phabricator.wikimedia.org/T137619#2373481 (10Gehel) [06:58:05] RECOVERY - puppet last run on druid1002 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:33] RECOVERY - puppet last run on mw2126 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:05:39] 06Operations, 06Discovery, 06Maps, 03Discovery-Maps-Sprint, 13Patch-For-Review: Configure monitoring / alerting of Postgresql / redis / ... cluster for maps - https://phabricator.wikimedia.org/T135647#2373499 (10Gehel) [07:05:41] 06Operations, 06Discovery, 06Maps, 07Epic: Epic: cultivating the Maps garden - https://phabricator.wikimedia.org/T137616#2373498 (10Gehel) [07:05:57] 06Operations, 06Discovery, 06Maps, 07Epic: Epic: switch Maps to production status - https://phabricator.wikimedia.org/T133744#2373500 (10Gehel) [07:05:59] 06Operations, 06Discovery, 06Maps, 03Discovery-Maps-Sprint, 13Patch-For-Review: Configure monitoring / alerting of Postgresql / redis / ... cluster for maps - https://phabricator.wikimedia.org/T135647#2305596 (10Gehel) [07:07:36] 06Operations, 06Discovery, 06Maps, 03Discovery-Maps-Sprint, 13Patch-For-Review: Configure monitoring / alerting of Postgresql / redis / ... cluster for maps - https://phabricator.wikimedia.org/T135647#2305596 (10Gehel) Postgresql and redis are only used for tile generation, not user facing operations. As... [07:11:12] RECOVERY - Disk space on lithium is OK: DISK OK [07:11:24] 06Operations, 06Discovery, 06Maps, 10Traffic: Send traffic to new maps200? servers - https://phabricator.wikimedia.org/T137620#2373503 (10Gehel) [07:20:42] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:22:33] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 0.016 second response time [07:33:25] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:36:55] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 0.009 second response time [07:45:14] PROBLEM - HHVM rendering on mw1155 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [07:46:13] PROBLEM - Apache HTTP on mw1155 is CRITICAL: Connection timed out [07:46:35] PROBLEM - puppet last run on logstash1001 is CRITICAL: CRITICAL: puppet fail [07:47:04] PROBLEM - Host mw1155 is DOWN: PING CRITICAL - Packet loss = 100% [07:54:54] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Proxy Error - 557 bytes in 0.010 second response time [07:58:54] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 0.009 second response time [08:07:23] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:09:14] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 7.784 second response time [08:15:14] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:19:05] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 0.030 second response time [08:39:34] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [08:41:33] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 0.580 second response time [09:18:51] PROBLEM - Start and verify pages via webservices on checker.tools.wmflabs.org is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 SERVICE UNAVAILABLE - 187 bytes in 10.900 second response time [09:20:50] RECOVERY - Start and verify pages via webservices on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 6.576 second response time [09:52:17] (03CR) 10Hashar: [C: 031] "Looks entirely harmless to me." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/291173 (owner: 10BryanDavis) [09:58:32] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Proxy Error - 557 bytes in 0.446 second response time [10:03:38] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 6.482 second response time [10:10:59] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Proxy Error - 557 bytes in 0.008 second response time [10:16:59] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 0.600 second response time [10:23:09] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [10:26:59] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 0.019 second response time [10:29:17] (03PS2) 10Microchip08: Redirect phabricator.mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/293464 (https://phabricator.wikimedia.org/T137252) [10:39:19] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Proxy Error - 557 bytes in 0.503 second response time [10:43:19] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 0.018 second response time [10:49:48] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 2 failures [10:52:19] PROBLEM - changeprop endpoints health on scb1002 is CRITICAL: Generic error: Generic connection error: HTTPConnectionPool(host=10.64.16.21, port=7272): Max retries exceeded with url: /?spec (Caused by ProtocolError(Connection aborted., error(111, Connection refused))) [10:58:00] that's me ^ [10:58:20] RECOVERY - changeprop endpoints health on scb1002 is OK: All endpoints are healthy [11:10:35] RECOVERY - Make sure enwiki dumps are not empty on checker.tools.wmflabs.org is OK: HTTP OK: HTTP/1.1 200 OK - 166 bytes in 0.078 second response time [11:15:15] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Proxy Error - 557 bytes in 0.001 second response time [11:15:44] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [11:19:15] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 0.011 second response time [11:41:35] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2373740 (10Danny_B) [11:42:02] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2361418 (10Danny_B) There's much more link types than those listed above. I started the spider to find them. Then I'l... [11:46:12] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2373745 (10Paladox) [11:47:14] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2361418 (10Paladox) [11:52:11] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2373749 (10Paladox) [11:52:24] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2373750 (10Danny_B) @mmodell Can't get the way how to translate https://phabricator.wikimedia.org/r/p/mediawiki/core... [11:56:34] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2373752 (10Paladox) [12:10:45] PROBLEM - puppet last run on cp4001 is CRITICAL: CRITICAL: puppet fail [12:16:04] PROBLEM - puppet last run on mw2151 is CRITICAL: CRITICAL: Puppet has 1 failures [12:18:24] PROBLEM - puppet last run on mw2161 is CRITICAL: CRITICAL: Puppet has 1 failures [12:20:34] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 3 failures [12:37:50] RECOVERY - puppet last run on cp4001 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [12:42:20] RECOVERY - puppet last run on mw2151 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:44:40] RECOVERY - puppet last run on mw2161 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [12:46:50] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 49 seconds ago with 0 failures [13:05:05] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Proxy Error - 557 bytes in 0.001 second response time [13:07:14] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 4.330 second response time [13:18:35] PROBLEM - HP RAID on ms-be1016 is CRITICAL: CHECK_NRPE: Socket timeout after 20 seconds. [13:20:35] RECOVERY - HP RAID on ms-be1016 is OK: OK: Slot 1: OK: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2, Controller, Battery/Capacitor [13:21:24] PROBLEM - puppet last run on labvirt1010 is CRITICAL: CRITICAL: Puppet has 1 failures [13:44:14] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Proxy Error - 557 bytes in 0.003 second response time [13:48:14] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 0.010 second response time [14:17:44] RECOVERY - puppet last run on labvirt1010 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [14:18:34] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:22:25] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 0.781 second response time [14:53:50] PROBLEM - graphite.wmflabs.org on labmon1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [14:55:50] RECOVERY - graphite.wmflabs.org on labmon1001 is OK: HTTP OK: HTTP/1.1 200 OK - 1701 bytes in 9.466 second response time [15:17:29] (03PS1) 10Ladsgroup: ores: fix workers and config [puppet] - 10https://gerrit.wikimedia.org/r/293904 [15:18:49] (03CR) 10jenkins-bot: [V: 04-1] ores: fix workers and config [puppet] - 10https://gerrit.wikimedia.org/r/293904 (owner: 10Ladsgroup) [15:21:53] (03PS2) 10Ladsgroup: ores: fix workers and config [puppet] - 10https://gerrit.wikimedia.org/r/293904 [15:33:45] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2373920 (10Danny_B) [15:39:42] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2373925 (10Danny_B) [16:00:53] (03CR) 10Glaisher: [C: 04-1] "There is no form field present for submitting the actual message." (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/225509 (https://phabricator.wikimedia.org/T98625) (owner: 10Alex Monk) [16:03:50] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2373943 (10Danny_B) [16:17:24] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2373945 (10Danny_B) [16:37:10] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2373947 (10Danny_B) [16:42:31] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2373954 (10Paladox) [17:22:55] (03CR) 10JanZerebecki: "Will this break when something else on the same box already does package { 'pigz'?" [puppet] - 10https://gerrit.wikimedia.org/r/293743 (owner: 10Jcrespo) [17:42:11] 06Operations, 10DBA, 06Labs, 07Tracking: Database replication services (tracking) - https://phabricator.wikimedia.org/T50930#2373988 (10valhallasw) [18:01:05] 06Operations, 06Labs, 10Labs-Infrastructure, 06Release-Engineering-Team, and 3 others: Firewall rules for labs support host to communicate with contint1001.eqiad.wmnet (new gallium) - https://phabricator.wikimedia.org/T137323#2374010 (10JanZerebecki) While T137323#2365101 is important. I don't understand T... [18:40:16] PROBLEM - puppet last run on lvs2001 is CRITICAL: CRITICAL: puppet fail [18:40:37] PROBLEM - Redis status tcp_6479 on rdb2006 is CRITICAL: CRITICAL: replication_delay is 654 600 - REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5624076 keys - replication_delay is 654 [18:44:36] RECOVERY - Redis status tcp_6479 on rdb2006 is OK: OK: REDIS on 10.192.48.44:6479 has 1 databases (db0) with 5562450 keys - replication_delay is 0 [18:58:02] 06Operations, 06Discovery, 06Maps: Ensure that Kartotherian on new maps200? servers are sending metrics to Graphite - https://phabricator.wikimedia.org/T137619#2374049 (10Gehel) Checked via tcpdump, metrics are sent. Small extract: ``` 18:56:31.566505 IP maps2001.codfw.wmnet.38665 > graphite1001.eqiad.wmnet... [18:58:25] 06Operations, 06Discovery, 06Maps, 07Epic: Epic: switch Maps to production status - https://phabricator.wikimedia.org/T133744#2374052 (10Gehel) [18:58:27] 06Operations, 06Discovery, 06Maps, 03Discovery-Maps-Sprint: Ensure that Kartotherian on new maps200? servers are sending metrics to Graphite - https://phabricator.wikimedia.org/T137619#2374050 (10Gehel) 05Open>03Resolved [18:59:44] (03PS1) 10Dereckson: Set import sources for pt.wikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/293912 (https://phabricator.wikimedia.org/T137633) [19:05:27] 06Operations, 06Discovery, 06Maps: Send logs to logstash for maps services (katotherian, tilerator, tileratorui) - https://phabricator.wikimedia.org/T137618#2374060 (10Gehel) The 3 services (kartotherian, tilerator, tileratorui) are all configured to send logs to logstash via Gelf (see /etc/(kartotherian|til... [19:07:24] RECOVERY - puppet last run on lvs2001 is OK: OK: Puppet is currently enabled, last run 29 seconds ago with 0 failures [19:12:18] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2374062 (10mmodell) >>! In T137224#2373750, @Danny_B wrote: > @mmodell Can't get the way how to translate > > https:... [19:40:48] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2374082 (10Danny_B) [19:42:39] 06Operations, 10Ops-Access-Requests, 06Services, 13Patch-For-Review: sc-admins should be able to join firejail containers - https://phabricator.wikimedia.org/T137412#2374084 (10JanZerebecki) @mobrovac Can you review if the patch would solve your request? [19:54:51] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2374087 (10Danny_B) [19:55:13] (03PS1) 10JanZerebecki: Add a notification parameter of analytics to cassandra monitoring [puppet] - 10https://gerrit.wikimedia.org/r/293916 (https://phabricator.wikimedia.org/T137422) [20:00:13] 06Operations, 06Labs, 10Labs-Infrastructure, 06Release-Engineering-Team, and 3 others: Firewall rules for labs support host to communicate with contint1001.eqiad.wmnet (new gallium) - https://phabricator.wikimedia.org/T137323#2364905 (10BBlack) @JanZerebecki - most of our production hosts are in private in... [20:15:48] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2374108 (10Danny_B) [20:38:30] (03CR) 10Muehlenhoff: "There shouldn't be any collisions; Debian ensures a unique namespace for anything in /usr/bin and local installations should go to /usr/lo" [puppet] - 10https://gerrit.wikimedia.org/r/293743 (owner: 10Jcrespo) [21:11:50] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2374123 (10Danny_B) [21:13:24] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2374124 (10Paladox) [21:28:18] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2374125 (10Danny_B) Where should /pages/... go? See https://git.wikimedia.org/summary/mediawiki%2Fextensions%2FCards... [21:29:15] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2374126 (10Paladox) [21:30:19] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2361418 (10Paladox) >>! In T137224#2374125, @Danny_B wrote: > Where should /pages/... go? > > See https://git.wikime... [21:36:37] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2374130 (10Paladox) [22:31:35] PROBLEM - Disk space on lithium is CRITICAL: DISK CRITICAL - free space: /srv/syslog 11452 MB (3% inode=99%) [22:39:47] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2374159 (10Paladox) [23:06:25] 06Operations, 10ops-codfw: Replace install2001's Ethernet cable - https://phabricator.wikimedia.org/T137647#2374163 (10faidon) [23:09:21] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2374176 (10Paladox) [23:13:56] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2374177 (10Paladox) [23:23:26] 06Operations, 06Release-Engineering-Team, 05Gitblit-Deprecate, 13Patch-For-Review: write Apache rewrite rules for gitblit -> diffusion migration - https://phabricator.wikimedia.org/T137224#2374179 (10Paladox) [23:51:22] PROBLEM - puppet last run on mw2123 is CRITICAL: CRITICAL: Puppet has 1 failures