[00:03:56] PROBLEM - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:04:46] RECOVERY - LVS HTTP IPv4 on thumbor.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 241 bytes in 0.002 second response time [00:15:04] mh another case of T151851 I think, checking now [00:15:05] T151851: Thumbor resource consumption is spiky - https://phabricator.wikimedia.org/T151851 [00:24:39] silenced until monday, we might as well remove load from it since it isin't in production [00:25:08] 06Operations, 06Commons, 06Multimedia, 10media-storage, 15User-Urbanecm: Metro Mad Linea 7.png file half-disappeared - it can't be used - https://phabricator.wikimedia.org/T153540#2883589 (10Tgr) Yay for mirrors! http://ftpmirror.your.org/pub/wikimedia/images/wikipedia/commons/c/cf/Metro_Mad_Linea_7.png... [00:45:03] PROBLEM - puppet last run on analytics1057 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:52:43] 06Operations, 06Commons, 06Multimedia, 10media-storage, 15User-Urbanecm: Metro Mad Linea 7.png file half-disappeared - it can't be used - https://phabricator.wikimedia.org/T153540#2884363 (10Tgr) Apparently if the file had been uploaded a year later, we would be out of luck: {T53001} Filed {T153565} abo... [00:55:07] 06Operations, 10Datasets-Archiving, 10Datasets-General-or-Unknown: Image tarball dumps on your.org are not being generated - https://phabricator.wikimedia.org/T53001#571615 (10Tgr) Is this specifically about the tarballs or is http://ftpmirror.your.org/pub/wikimedia/images/ similarly affected? Given our tend... [01:15:03] RECOVERY - puppet last run on analytics1057 is OK: OK: Puppet is currently enabled, last run 45 seconds ago with 0 failures [01:23:07] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Consider switching to HTTPS for Wikidata query service links - https://phabricator.wikimedia.org/T153563#2884411 (10Smalyshev) We previously discussed this and the tradition for entity identifiers is to use http. E.g. such commonly known prefi... [01:25:03] PROBLEM - puppet last run on db1053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:30:13] PROBLEM - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [01:30:23] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [01:30:23] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [01:30:33] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [01:30:33] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [01:31:03] PROBLEM - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [01:31:23] PROBLEM - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is CRITICAL: connect to address 10.64.0.119 and port 9042: Connection refused [01:31:23] PROBLEM - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is CRITICAL: connect to address 10.64.32.207 and port 9042: Connection refused [01:31:43] PROBLEM - Check systemd state on restbase1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:31:53] PROBLEM - cassandra-c service on restbase1013 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [01:31:53] PROBLEM - Check systemd state on restbase1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [01:31:54] PROBLEM - cassandra-c service on restbase1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [01:32:00] mutante: ^ [01:32:33] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [01:53:03] RECOVERY - puppet last run on db1053 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [02:18:51] !log l10nupdate@tin scap sync-l10n completed (1.29.0-wmf.6) (duration: 06m 39s) [02:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:23:12] !log l10nupdate@tin ResourceLoader cache refresh completed at Sun Dec 18 02:23:11 UTC 2016 (duration 4m 20s) [02:23:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:24:23] PROBLEM - puppet last run on maerlant is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [02:30:47] I see l10n is back to normal [02:36:18] yeah [02:36:40] the sync-l10n scap package update was deployed to the servers [02:41:11] What was wrong with it? [02:52:23] RECOVERY - puppet last run on maerlant is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [02:53:59] there was a thing [02:54:06] it broke stuff [02:54:27] so it broke l10nupdate-1 [02:55:41] Zppix, https://phabricator.wikimedia.org/T152390 [03:16:07] There was a thing it broke stuff so it broke... thats the best thing ive heard on irc Krenair [03:23:53] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 681.06 seconds [03:24:03] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:29:53] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 161.21 seconds [03:33:03] PROBLEM - puppet last run on mw1286 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/share/GeoIP/GeoIP2-City.mmdb.gz] [03:35:45] Is Mathoid having issues? [03:35:50] Getting... [03:36:59] getting...? [03:40:05] Yvette? [03:47:48] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Consider switching to HTTPS for Wikidata query service links - https://phabricator.wikimedia.org/T153563#2884514 (10MZMcBride) >>! In T153563#2884409, @Smalyshev wrote: > We previously discussed this and the tradition for entity identifiers is... [03:48:02] looks like Yvette's thing is in #mediawiki [03:49:10] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Consider switching to HTTPS for Wikidata query service links - https://phabricator.wikimedia.org/T153563#2884515 (10MZMcBride) In this specific context, the query service is outputting URLs (yes, URLs, right here in River City) such as this on... [03:49:11] > Failed to parse (MathML with SVG or PNG fallback (recommended for modern browsers and accessibility tools): Invalid response ("Math extension cannot connect to Restbase.") from server "/mathoid/local/v1/":): {\displaystyle \left(\frac {dG}{d\xi}\right)_{T,p} = RT \ln \left(\frac {Q_\mathrm{r}}{K_\mathrm{eq}}\right)~} [03:49:22] Sorry, got distracted with that other task. [03:50:32] From https://en.wikipedia.org/w/index.php?title=Chemical_equilibrium&action=history [03:51:27] https://en.wikipedia.org/wiki/Chemical_equilibrium#Addition_of_reactants_or_products [03:52:03] PROBLEM - puppet last run on cp1065 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:52:04] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 15 seconds ago with 0 failures [04:01:03] RECOVERY - puppet last run on mw1286 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [04:04:53] RECOVERY - Check systemd state on restbase1013 is OK: OK - running: The system is fully operational [04:05:03] RECOVERY - cassandra-c service on restbase1013 is OK: OK - cassandra-c is active [04:07:53] PROBLEM - Check systemd state on restbase1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:08:03] PROBLEM - cassandra-c service on restbase1013 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [04:20:03] RECOVERY - puppet last run on cp1065 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [04:39:53] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=746.30 Read Requests/Sec=248.10 Write Requests/Sec=0.90 KBytes Read/Sec=31713.60 KBytes_Written/Sec=20.40 [04:50:03] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] [04:50:53] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=4.10 Read Requests/Sec=0.10 Write Requests/Sec=0.30 KBytes Read/Sec=0.40 KBytes_Written/Sec=10.40 [04:55:03] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [04:56:54] 06Operations, 06Discovery, 10Traffic, 10Wikidata, and 2 others: Consider switching to HTTPS for Wikidata query service links - https://phabricator.wikimedia.org/T153563#2884289 (10Esc3300) For the reasons WMF uses https shouldn't be make sure that users don't access http ? If people are given sufficient... [04:58:48] 06Operations, 10ArchCom-RfC, 06Commons, 10MediaWiki-File-management, and 14 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214#2884546 (10Tgr) After thinking more about this and looking at the code I am getting more and more confused about what exactly we are trying to do. Medi... [05:34:53] RECOVERY - Check systemd state on restbase1013 is OK: OK - running: The system is fully operational [05:35:03] RECOVERY - cassandra-c service on restbase1013 is OK: OK - cassandra-c is active [05:37:53] PROBLEM - Check systemd state on restbase1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [05:38:03] PROBLEM - cassandra-c service on restbase1013 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [06:02:09] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 90.00% of data above the critical threshold [1000.0] [06:06:27] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] [06:19:03] https://meta.wikimedia.org/wiki/Special:GlobalUserRights --> [WFYqGgpAAEQAAjAVLSkAAAAW] 2016-12-18 06:18:02: Fatal exception of type MWException [06:19:07] Logged in. [06:25:56] PROBLEM - MariaDB disk space on db1047 is CRITICAL: DISK CRITICAL - free space: / 398 MB (5% inode=54%) [06:26:03] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [1000.0] [06:26:56] RECOVERY - MariaDB disk space on db1047 is OK: DISK OK [06:28:03] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] [06:29:07] got paged for db1047 space, can takr a look on 20 min or so [06:42:33] RECOVERY - Check systemd state on restbase1011 is OK: OK - running: The system is fully operational [06:42:33] PROBLEM - puppet last run on elastic2009 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[tmux],Package[pv] [06:45:33] PROBLEM - Check systemd state on restbase1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:47:03] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [06:48:42] recovered by itself shortly afterwards heh [06:56:28] (03PS2) 10Tim Landscheidt: Tools: Make tools-clush-generator project-agnostic [puppet] - 10https://gerrit.wikimedia.org/r/326892 [06:56:30] (03PS1) 10Tim Landscheidt: Tools: Generate node sets dynamically [puppet] - 10https://gerrit.wikimedia.org/r/328030 [06:59:47] (03CR) 10Tim Landscheidt: [] "Tested on Toolsbeta." [puppet] - 10https://gerrit.wikimedia.org/r/328030 (owner: 10Tim Landscheidt) [07:02:54] 07Puppet, 06Labs, 10Tool-Labs: Make standalone puppetmasters optionally use PuppetDB - https://phabricator.wikimedia.org/T153577#2884609 (10scfc) [07:04:49] I filed the exception as https://phabricator.wikimedia.org/T153578 [07:05:53] RECOVERY - Check systemd state on restbase1013 is OK: OK - running: The system is fully operational [07:06:03] RECOVERY - cassandra-c service on restbase1013 is OK: OK - cassandra-c is active [07:07:09] !log force git-fat pull for twcs on restbase1* to restore twcs jar [07:07:13] RECOVERY - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is OK: SSL OK - Certificate restbase1013-c valid until 2017-09-12 15:34:23 +0000 (expires in 268 days) [07:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:24] RECOVERY - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is OK: TCP OK - 0.002 second response time on 10.64.32.207 port 9042 [07:07:33] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [07:07:33] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [07:07:33] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [07:07:33] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [07:07:34] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [07:10:33] RECOVERY - puppet last run on elastic2009 is OK: OK: Puppet is currently enabled, last run 18 seconds ago with 0 failures [07:12:04] RECOVERY - cassandra-c service on restbase1011 is OK: OK - cassandra-c is active [07:12:33] RECOVERY - Check systemd state on restbase1011 is OK: OK - running: The system is fully operational [07:12:34] 06Operations, 06Commons, 06Multimedia, 10media-storage, 15User-Urbanecm: Metro Mad Linea 7.png file half-disappeared - it can't be used - https://phabricator.wikimedia.org/T153540#2884634 (10zhuyifei1999) >>! In T153540#2884321, @Tgr wrote: > Yay for mirrors! http://ftpmirror.your.org/pub/wikimedia/image... [07:13:13] RECOVERY - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is OK: SSL OK - Certificate restbase1011-c valid until 2017-09-12 15:34:08 +0000 (expires in 268 days) [07:13:24] RECOVERY - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is OK: TCP OK - 0.004 second response time on 10.64.0.119 port 9042 [07:52:42] (03PS1) 10Tim Landscheidt: WIP [puppet] - 10https://gerrit.wikimedia.org/r/328031 (https://phabricator.wikimedia.org/T150726) [08:26:23] PROBLEM - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [08:26:23] PROBLEM - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [08:26:24] PROBLEM - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is CRITICAL: connect to address 10.64.32.207 and port 9042: Connection refused [08:27:03] PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [08:27:03] PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:27:13] PROBLEM - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.120 and port 9042: Connection refused [08:28:03] PROBLEM - cassandra-c service on restbase1013 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [08:28:53] PROBLEM - Check systemd state on restbase1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:30:03] PROBLEM - puppet last run on ms-be1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:34:03] PROBLEM - puppet last run on sca1004 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [08:34:53] RECOVERY - Check systemd state on restbase1013 is OK: OK - running: The system is fully operational [08:35:03] RECOVERY - cassandra-c service on restbase1013 is OK: OK - cassandra-c is active [08:36:23] RECOVERY - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is OK: SSL OK - Certificate restbase1013-c valid until 2017-09-12 15:34:23 +0000 (expires in 268 days) [08:36:24] RECOVERY - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is OK: TCP OK - 0.003 second response time on 10.64.32.207 port 9042 [08:39:33] PROBLEM - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [08:39:33] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [08:39:34] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [08:39:34] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [08:39:34] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [08:39:43] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [08:40:13] PROBLEM - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [08:40:23] PROBLEM - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is CRITICAL: connect to address 10.64.0.119 and port 9042: Connection refused [08:40:24] PROBLEM - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is CRITICAL: connect to address 10.64.32.207 and port 9042: Connection refused [08:40:33] PROBLEM - Check systemd state on restbase1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:41:03] PROBLEM - cassandra-c service on restbase1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [08:41:53] PROBLEM - Check systemd state on restbase1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:42:03] PROBLEM - cassandra-c service on restbase1013 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [08:42:04] RECOVERY - cassandra-c service on restbase1011 is OK: OK - cassandra-c is active [08:42:33] RECOVERY - Check systemd state on restbase1011 is OK: OK - running: The system is fully operational [08:43:03] RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active [08:43:03] RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational [08:43:06] !log forced puppet on restbase1009 to bring up cassandra-a (stopped due to OOM issues) [08:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:23] RECOVERY - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is OK: SSL OK - Certificate restbase1011-c valid until 2017-09-12 15:34:08 +0000 (expires in 268 days) [08:43:23] RECOVERY - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is OK: TCP OK - 0.003 second response time on 10.64.0.119 port 9042 [08:43:33] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [08:43:33] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [08:43:33] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [08:43:33] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [08:43:33] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [08:44:13] RECOVERY - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is OK: TCP OK - 0.000 second response time on 10.64.48.120 port 9042 [08:44:33] RECOVERY - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is OK: SSL OK - Certificate restbase1009-a valid until 2017-09-12 15:33:48 +0000 (expires in 268 days) [08:46:33] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [08:46:33] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [08:46:33] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [08:46:33] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [08:46:33] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [08:47:03] PROBLEM - cassandra-c service on restbase1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [08:47:13] PROBLEM - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [08:47:14] PROBLEM - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.120 and port 9042: Connection refused [08:47:23] PROBLEM - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [08:47:24] PROBLEM - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is CRITICAL: connect to address 10.64.0.119 and port 9042: Connection refused [08:47:33] PROBLEM - Check systemd state on restbase1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [08:48:33] RECOVERY - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is OK: SSL OK - Certificate restbase1009-a valid until 2017-09-12 15:33:48 +0000 (expires in 268 days) [08:49:13] RECOVERY - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is OK: TCP OK - 0.000 second response time on 10.64.48.120 port 9042 [08:49:38] !log forced restart for cassandra-a on restbase1009 (still OOMs) [08:49:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:50] !log forced restart of cassandra-b/c on restbase1013 (b not really needed, my error) [08:51:53] RECOVERY - Check systemd state on restbase1013 is OK: OK - running: The system is fully operational [08:52:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:03] RECOVERY - cassandra-c service on restbase1013 is OK: OK - cassandra-c is active [08:52:24] RECOVERY - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is OK: TCP OK - 0.002 second response time on 10.64.32.207 port 9042 [08:52:33] RECOVERY - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is OK: SSL OK - Certificate restbase1013-c valid until 2017-09-12 15:34:23 +0000 (expires in 268 days) [08:52:43] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [08:52:43] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [08:52:43] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [08:52:43] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [08:53:33] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [08:57:03] RECOVERY - cassandra-c service on restbase1011 is OK: OK - cassandra-c is active [08:57:20] !log forced restart of cassandra-c on restbase1011 [08:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:33] RECOVERY - Check systemd state on restbase1011 is OK: OK - running: The system is fully operational [08:58:03] RECOVERY - puppet last run on ms-be1014 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [08:58:13] RECOVERY - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is OK: SSL OK - Certificate restbase1011-c valid until 2017-09-12 15:34:08 +0000 (expires in 268 days) [08:58:23] RECOVERY - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is OK: TCP OK - 0.005 second response time on 10.64.0.119 port 9042 [09:00:03] PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:00:14] PROBLEM - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.120 and port 9042: Connection refused [09:00:23] PROBLEM - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [09:00:24] PROBLEM - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is CRITICAL: connect to address 10.64.32.207 and port 9042: Connection refused [09:00:33] PROBLEM - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [09:00:53] PROBLEM - Check systemd state on restbase1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:01:03] PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [09:01:03] PROBLEM - cassandra-c service on restbase1013 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [09:03:03] RECOVERY - puppet last run on sca1004 is OK: OK: Puppet is currently enabled, last run 55 seconds ago with 0 failures [09:05:53] RECOVERY - Check systemd state on restbase1013 is OK: OK - running: The system is fully operational [09:06:03] RECOVERY - cassandra-c service on restbase1013 is OK: OK - cassandra-c is active [09:06:43] RECOVERY - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is OK: SSL OK - Certificate restbase1013-c valid until 2017-09-12 15:34:23 +0000 (expires in 268 days) [09:07:24] RECOVERY - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is OK: TCP OK - 0.000 second response time on 10.64.32.207 port 9042 [09:15:03] RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active [09:15:03] RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational [09:15:33] RECOVERY - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is OK: SSL OK - Certificate restbase1009-a valid until 2017-09-12 15:33:48 +0000 (expires in 268 days) [09:16:13] RECOVERY - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is OK: TCP OK - 0.000 second response time on 10.64.48.120 port 9042 [09:24:39] urandom: ---^ [09:24:56] a big cassandra instaces flapping event [09:46:51] * elukey afk! [10:05:33] PROBLEM - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is CRITICAL: connect to address 10.64.32.207 and port 9042: Connection refused [10:05:43] PROBLEM - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [10:06:13] PROBLEM - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.120 and port 9042: Connection refused [10:06:23] PROBLEM - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [10:06:54] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2882187 (10Urbanecm) [10:07:03] PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [10:07:13] PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:07:34] RECOVERY - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is OK: TCP OK - 0.000 second response time on 10.64.32.207 port 9042 [10:07:43] RECOVERY - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is OK: SSL OK - Certificate restbase1013-c valid until 2017-09-12 15:34:23 +0000 (expires in 268 days) [10:07:46] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2882187 (10Urbanecm) [10:08:01] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2882187 (10Urbanecm) [10:12:33] PROBLEM - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is CRITICAL: connect to address 10.64.32.207 and port 9042: Connection refused [10:12:53] PROBLEM - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [10:14:03] PROBLEM - Check systemd state on restbase1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:14:23] PROBLEM - cassandra-c service on restbase1013 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [10:17:13] RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational [10:18:03] RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active [10:20:33] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [10:20:34] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [10:20:34] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [10:20:34] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [10:20:43] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [10:21:13] PROBLEM - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [10:21:23] PROBLEM - cassandra-c service on restbase1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [10:21:33] PROBLEM - Check systemd state on restbase1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:21:33] PROBLEM - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is CRITICAL: connect to address 10.64.0.119 and port 9042: Connection refused [10:22:04] PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [10:22:13] PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:35:23] RECOVERY - cassandra-c service on restbase1013 is OK: OK - cassandra-c is active [10:36:04] RECOVERY - Check systemd state on restbase1013 is OK: OK - running: The system is fully operational [10:36:33] RECOVERY - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is OK: TCP OK - 0.000 second response time on 10.64.32.207 port 9042 [10:36:33] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [10:36:43] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [10:36:43] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [10:36:43] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [10:36:43] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [10:36:53] RECOVERY - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is OK: SSL OK - Certificate restbase1013-c valid until 2017-09-12 15:34:23 +0000 (expires in 268 days) [10:42:23] RECOVERY - cassandra-c service on restbase1011 is OK: OK - cassandra-c is active [10:42:33] RECOVERY - Check systemd state on restbase1011 is OK: OK - running: The system is fully operational [10:43:14] RECOVERY - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is OK: SSL OK - Certificate restbase1011-c valid until 2017-09-12 15:34:08 +0000 (expires in 268 days) [10:43:33] RECOVERY - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is OK: TCP OK - 0.000 second response time on 10.64.0.119 port 9042 [10:48:03] RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active [10:48:13] RECOVERY - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is OK: TCP OK - 0.000 second response time on 10.64.48.120 port 9042 [10:48:14] RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational [10:48:23] RECOVERY - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is OK: SSL OK - Certificate restbase1009-a valid until 2017-09-12 15:33:48 +0000 (expires in 268 days) [10:54:44] (03PS1) 10Urbanecm: Add ftpmirror.your.org to whitelist of commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/328036 (https://phabricator.wikimedia.org/T153569) [11:00:33] PROBLEM - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is CRITICAL: connect to address 10.64.32.207 and port 9042: Connection refused [11:00:53] PROBLEM - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:01:13] PROBLEM - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.120 and port 9042: Connection refused [11:01:23] PROBLEM - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:01:23] 06Operations, 06Commons, 06Multimedia, 10media-storage, 15User-Urbanecm: Metro Mad Linea 7.png file half-disappeared - it can't be used - https://phabricator.wikimedia.org/T153540#2884869 (10Urbanecm) 05Open>03Resolved a:03zhuyifei1999 Thanks for resolving, seems it works, marking as resolved. [11:02:03] PROBLEM - Check systemd state on restbase1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:02:04] PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [11:02:13] PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:02:23] PROBLEM - cassandra-c service on restbase1013 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [11:03:14] (03PS1) 10Zhuyifei1999: videoscaler: Reduce runners_transcode from 5 to 2 [puppet] - 10https://gerrit.wikimedia.org/r/328037 (https://phabricator.wikimedia.org/T153488) [11:05:03] RECOVERY - Check systemd state on restbase1013 is OK: OK - running: The system is fully operational [11:05:07] 06Operations, 06Commons, 10TimedMediaHandler-Transcode, 10Wikimedia-Video, 13Patch-For-Review: Commons video transcoders have over 6500 tasks in the backlog. - https://phabricator.wikimedia.org/T153488#2884876 (10zhuyifei1999) 500% load. Seems that each runner cause 100% load. [11:05:23] RECOVERY - cassandra-c service on restbase1013 is OK: OK - cassandra-c is active [11:06:33] RECOVERY - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is OK: TCP OK - 0.001 second response time on 10.64.32.207 port 9042 [11:06:53] RECOVERY - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is OK: SSL OK - Certificate restbase1013-c valid until 2017-09-12 15:34:23 +0000 (expires in 268 days) [11:17:04] RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active [11:17:13] RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational [11:18:13] RECOVERY - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is OK: TCP OK - 0.000 second response time on 10.64.48.120 port 9042 [11:18:23] RECOVERY - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is OK: SSL OK - Certificate restbase1009-a valid until 2017-09-12 15:33:48 +0000 (expires in 268 days) [11:22:03] PROBLEM - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:22:04] PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [11:22:13] PROBLEM - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.120 and port 9042: Connection refused [11:22:14] PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:22:23] PROBLEM - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:22:33] PROBLEM - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is CRITICAL: connect to address 10.64.32.207 and port 9042: Connection refused [11:23:03] PROBLEM - Check systemd state on restbase1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:23:23] PROBLEM - cassandra-c service on restbase1013 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [11:33:33] PROBLEM - puppet last run on cp4006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:36:03] RECOVERY - Check systemd state on restbase1013 is OK: OK - running: The system is fully operational [11:36:23] RECOVERY - cassandra-c service on restbase1013 is OK: OK - cassandra-c is active [11:36:33] RECOVERY - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is OK: TCP OK - 0.000 second response time on 10.64.32.207 port 9042 [11:37:03] RECOVERY - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is OK: SSL OK - Certificate restbase1013-c valid until 2017-09-12 15:34:23 +0000 (expires in 268 days) [11:44:33] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [11:44:34] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [11:44:34] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [11:44:34] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [11:44:34] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [11:45:13] PROBLEM - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:45:14] PROBLEM - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [11:45:23] PROBLEM - puppet last run on mw1254 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:45:33] PROBLEM - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is CRITICAL: connect to address 10.64.0.119 and port 9042: Connection refused [11:45:33] PROBLEM - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is CRITICAL: connect to address 10.64.32.207 and port 9042: Connection refused [11:46:23] PROBLEM - cassandra-c service on restbase1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [11:46:33] PROBLEM - Check systemd state on restbase1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:47:03] PROBLEM - Check systemd state on restbase1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [11:47:04] RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active [11:47:13] RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational [11:47:23] PROBLEM - cassandra-c service on restbase1013 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [11:48:13] RECOVERY - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is OK: TCP OK - 0.000 second response time on 10.64.48.120 port 9042 [11:48:23] RECOVERY - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is OK: SSL OK - Certificate restbase1009-a valid until 2017-09-12 15:33:48 +0000 (expires in 268 days) [12:02:33] RECOVERY - puppet last run on cp4006 is OK: OK: Puppet is currently enabled, last run 9 seconds ago with 0 failures [12:05:23] RECOVERY - cassandra-c service on restbase1013 is OK: OK - cassandra-c is active [12:06:03] RECOVERY - Check systemd state on restbase1013 is OK: OK - running: The system is fully operational [12:06:23] RECOVERY - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is OK: SSL OK - Certificate restbase1013-c valid until 2017-09-12 15:34:23 +0000 (expires in 268 days) [12:06:33] RECOVERY - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is OK: TCP OK - 0.000 second response time on 10.64.32.207 port 9042 [12:06:43] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [12:06:43] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [12:06:43] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [12:06:43] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [12:06:44] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [12:08:22] !log disabling puppet on restbase1009, restbase1011 and restbase1013 due to cassandra OOMs [12:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:23] PROBLEM - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [12:09:23] PROBLEM - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [12:09:33] PROBLEM - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is CRITICAL: connect to address 10.64.32.207 and port 9042: Connection refused [12:09:34] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [12:09:34] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [12:09:34] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [12:09:43] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [12:09:43] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/definition/{title} (retrieve en-wiktionary definitions for cat) is CRITICAL: Test retrieve en-wiktionary definitions for cat returned the unexpected status 500 (expecting: 200) [12:10:04] PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [12:10:14] PROBLEM - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.120 and port 9042: Connection refused [12:10:14] PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:11:03] PROBLEM - Check systemd state on restbase1013 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:11:23] PROBLEM - cassandra-c service on restbase1013 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [12:13:23] RECOVERY - puppet last run on mw1254 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [12:16:23] RECOVERY - cassandra-c service on restbase1013 is OK: OK - cassandra-c is active [12:17:03] RECOVERY - Check systemd state on restbase1013 is OK: OK - running: The system is fully operational [12:17:23] RECOVERY - cassandra-c SSL 10.64.32.207:7001 on restbase1013 is OK: SSL OK - Certificate restbase1013-c valid until 2017-09-12 15:34:23 +0000 (expires in 268 days) [12:17:33] RECOVERY - cassandra-c CQL 10.64.32.207:9042 on restbase1013 is OK: TCP OK - 0.000 second response time on 10.64.32.207 port 9042 [12:17:38] !log started back cassandra restbase1013-c [12:17:43] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [12:17:43] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [12:17:43] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [12:17:43] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy [12:17:43] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [12:17:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:27:23] RECOVERY - cassandra-c service on restbase1011 is OK: OK - cassandra-c is active [12:27:29] !log started back cassandra restbase1011-c [12:27:33] RECOVERY - Check systemd state on restbase1011 is OK: OK - running: The system is fully operational [12:27:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:29:13] RECOVERY - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is OK: SSL OK - Certificate restbase1011-c valid until 2017-09-12 15:34:08 +0000 (expires in 268 days) [12:29:34] RECOVERY - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is OK: TCP OK - 0.001 second response time on 10.64.0.119 port 9042 [12:37:03] RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active [12:37:13] RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational [12:37:33] RECOVERY - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is OK: SSL OK - Certificate restbase1009-a valid until 2017-09-12 15:33:48 +0000 (expires in 268 days) [12:38:12] !log started back cassandra restbase1009-a [12:38:13] RECOVERY - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is OK: TCP OK - 0.000 second response time on 10.64.48.120 port 9042 [12:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:13] PROBLEM - puppet last run on mc1017 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:26:13] RECOVERY - puppet last run on mc1017 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [13:34:44] \o [13:35:13] not sure if this is where to ask, but I've been getting tons of spam irc messages [13:35:48] and I don't know how to stop it [13:38:24] so have most people [13:38:30] #freenode perhaps [13:38:50] nothing we can do? :( [13:38:55] no [13:39:03] you can set umode +R [13:39:39] ok, thanks :( [14:16:33] PROBLEM - PyBal backends health check on lvs1003 is CRITICAL: PYBAL CRITICAL - thumbor_8800 - Could not depool server thumbor1002.eqiad.wmnet because of too many down! [14:20:33] RECOVERY - PyBal backends health check on lvs1003 is OK: PYBAL OK - All pools are healthy [14:37:14] elukey: around ? [15:37:13] PROBLEM - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.120 and port 9042: Connection refused [15:37:23] PROBLEM - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:37:23] PROBLEM - cassandra-c service on restbase1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [15:37:23] PROBLEM - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [15:37:33] PROBLEM - Check systemd state on restbase1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:37:33] PROBLEM - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is CRITICAL: connect to address 10.64.0.119 and port 9042: Connection refused [15:38:13] PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:38:24] PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [15:55:53] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [15:56:43] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [16:36:23] PROBLEM - puppet last run on mw1164 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:45:14] RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational [16:45:23] RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active [16:45:41] !log starting cassandra instances on restbase1009, restbase1011 and restbase1013 (one at the time) - T153588 [16:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:54] T153588: Cassandra OOMs on restbase1009-a, restbase1011-c and restbase1013-c - https://phabricator.wikimedia.org/T153588 [16:46:13] RECOVERY - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is OK: TCP OK - 0.000 second response time on 10.64.48.120 port 9042 [16:46:23] RECOVERY - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is OK: SSL OK - Certificate restbase1009-a valid until 2017-09-12 15:33:48 +0000 (expires in 267 days) [16:46:38] urandom: you there? [16:49:23] RECOVERY - cassandra-c service on restbase1011 is OK: OK - cassandra-c is active [16:49:28] ok 1009 is up, proceeding with 1011 [16:49:33] RECOVERY - Check systemd state on restbase1011 is OK: OK - running: The system is fully operational [16:50:33] RECOVERY - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is OK: TCP OK - 0.000 second response time on 10.64.0.119 port 9042 [16:50:43] RECOVERY - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is OK: SSL OK - Certificate restbase1011-c valid until 2017-09-12 15:34:08 +0000 (expires in 267 days) [16:52:43] PROBLEM - puppet last run on restbase1009 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [16:54:33] PROBLEM - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [16:54:34] PROBLEM - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is CRITICAL: connect to address 10.64.0.119 and port 9042: Connection refused [16:55:12] nope, 1011 doesn't want to come up [16:55:13] PROBLEM - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is CRITICAL: connect to address 10.64.48.120 and port 9042: Connection refused [16:55:14] PROBLEM - Check systemd state on restbase1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [16:55:23] PROBLEM - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [16:55:23] PROBLEM - cassandra-c service on restbase1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [16:55:24] PROBLEM - cassandra-a service on restbase1009 is CRITICAL: CRITICAL - Expecting active but unit cassandra-a is failed [16:55:42] and now again 1009 [16:55:43] PROBLEM - puppet last run on analytics1053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:55:44] sigh [16:55:48] mobrovac: you there? [16:56:23] RECOVERY - cassandra-c service on restbase1011 is OK: OK - cassandra-c is active [16:56:33] RECOVERY - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is OK: SSL OK - Certificate restbase1011-c valid until 2017-09-12 15:34:08 +0000 (expires in 267 days) [16:56:34] RECOVERY - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is OK: TCP OK - 0.000 second response time on 10.64.0.119 port 9042 [16:57:19] ah saw your email [16:57:35] ops people: Marko is commuting and will take care of the instances in ~25 mins [17:00:33] PROBLEM - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [17:01:23] PROBLEM - cassandra-c service on restbase1011 is CRITICAL: CRITICAL - Expecting active but unit cassandra-c is failed [17:01:33] PROBLEM - Check systemd state on restbase1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [17:01:34] PROBLEM - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is CRITICAL: connect to address 10.64.0.119 and port 9042: Connection refused [17:04:23] RECOVERY - puppet last run on mw1164 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [17:19:54] (03CR) 10Dereckson: [] "The current '5' value has been introduced by commit 95b52b9b48." [puppet] - 10https://gerrit.wikimedia.org/r/328037 (https://phabricator.wikimedia.org/T153488) (owner: 10Zhuyifei1999) [17:23:43] RECOVERY - puppet last run on analytics1053 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [17:27:33] RECOVERY - puppet last run on restbase1009 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [17:28:24] RECOVERY - Check systemd state on restbase1009 is OK: OK - running: The system is fully operational [17:28:24] RECOVERY - cassandra-a service on restbase1009 is OK: OK - cassandra-a is active [17:29:13] RECOVERY - cassandra-a CQL 10.64.48.120:9042 on restbase1009 is OK: TCP OK - 0.000 second response time on 10.64.48.120 port 9042 [17:29:23] RECOVERY - cassandra-a SSL 10.64.48.120:7001 on restbase1009 is OK: SSL OK - Certificate restbase1009-a valid until 2017-09-12 15:33:48 +0000 (expires in 267 days) [17:30:23] RECOVERY - cassandra-c service on restbase1011 is OK: OK - cassandra-c is active [17:30:33] RECOVERY - Check systemd state on restbase1011 is OK: OK - running: The system is fully operational [17:30:43] RECOVERY - cassandra-c SSL 10.64.0.119:7001 on restbase1011 is OK: SSL OK - Certificate restbase1011-c valid until 2017-09-12 15:34:08 +0000 (expires in 267 days) [17:31:33] RECOVERY - cassandra-c CQL 10.64.0.119:9042 on restbase1011 is OK: TCP OK - 0.001 second response time on 10.64.0.119 port 9042 [18:17:53] PROBLEM - restbase endpoints health on restbase-test2002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:17:53] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:17:53] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:17:53] PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:17:53] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:18:43] RECOVERY - restbase endpoints health on restbase-test2002 is OK: All endpoints are healthy [18:19:43] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [18:19:44] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [18:19:44] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [18:19:44] RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy [18:23:45] ^ ? [18:28:18] (03Draft1) 10Paladox: Contint: Make sure /mnt/home/jenkins-deploy/tmpfs is mounted before starting MySQL [puppet] - 10https://gerrit.wikimedia.org/r/328051 (https://phabricator.wikimedia.org/T141450) [18:28:23] (03Draft2) 10Paladox: Contint: Make sure /mnt/home/jenkins-deploy/tmpfs is mounted before starting MySQL [puppet] - 10https://gerrit.wikimedia.org/r/328051 (https://phabricator.wikimedia.org/T141450) [18:32:59] !log Testing [18:33:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:15] mobrovac: i wonder if that wasn't a consequence of the node outages, that client library doesn't seem very robust in the face of lost connections [18:34:49] ah right [18:34:52] totally possible [18:35:03] either that or gremlins [18:35:28] * urandom lights a black candle [18:35:51] haha [18:37:13] urandom: trying to update https://commons.wikimedia.org/wiki/User%3AJ_budissin%2FUploads%2FBiH%2F2016_December_11-20 produces an error in cass [18:37:35] which isn't a surprise seeing the size of the page [18:38:59] (03PS3) 10Paladox: Contint: Notify Service mysql to restart [puppet] - 10https://gerrit.wikimedia.org/r/328051 (https://phabricator.wikimedia.org/T141450) [18:39:21] (03PS4) 10Paladox: Contint: Notify Service mysql to restart [puppet] - 10https://gerrit.wikimedia.org/r/328051 (https://phabricator.wikimedia.org/T141450) [18:40:28] mobrovac: timeout? [18:41:43] not clear from the log i have, but smells like it [18:41:48] chrome can't even load the page for me [18:42:12] it's still trying here :) [18:42:17] haha [18:42:54] heh, i was just asked if i wanted to kill it or wait [18:43:29] * urandom is in a killing mood [18:43:42] i waited twice then gave up on it [18:44:02] ok here's another candidate for blacklisting - https://commons.wikimedia.org/w/index.php?title=User:OgreBot/Uploads_by_new_users&action=history [18:44:14] editing a huge log every hour [18:46:10] * mobrovac is in a blacklisting mood [18:46:22] wack-a-mole [18:47:23] yup [18:50:53] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [18:51:44] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [19:12:23] PROBLEM - puppet last run on db1083 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:38:15] 06Operations, 10Wikimedia-Mailing-lists: Reach out to Google about @yahoo.com emails not reaching gmail inboxes (when sent to mailing lists) - https://phabricator.wikimedia.org/T146841#2885268 (10Aklapper) >>! In T146841#2728793, @faidon wrote: > This is most likely related to Yahoo's DMARC policy, cf. T66818.... [19:40:23] RECOVERY - puppet last run on db1083 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [20:25:43] PROBLEM - puppet last run on sca1003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [20:39:53] PROBLEM - puppet last run on sca2004 is CRITICAL: CRITICAL: Puppet has 18 failures. Last run 2 minutes ago with 18 failures. Failed resources (up to 3 shown): Exec[ip addr add 2620:0:860:102:10:192:16:30/64 dev eth0],Service[ferm],Service[diamond],Service[prometheus-node-exporter] [20:47:50] (03PS1) 10Mobrovac: Conftool: Add restbase101[678] and restbase201[012] [puppet] - 10https://gerrit.wikimedia.org/r/328059 (https://phabricator.wikimedia.org/T151086) [20:49:43] RECOVERY - puppet last run on sca1003 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [21:01:53] PROBLEM - puppet last run on sca2003 is CRITICAL: CRITICAL: Puppet has 27 failures. Last run 2 minutes ago with 27 failures. Failed resources (up to 3 shown): Exec[eth0_v6_token],Package[wipe],Package[zotero/translators],Package[zotero/translation-server] [21:07:53] RECOVERY - puppet last run on sca2004 is OK: OK: Puppet is currently enabled, last run 20 seconds ago with 0 failures [21:29:23] PROBLEM - puppet last run on wtp1005 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [21:29:53] RECOVERY - puppet last run on sca2003 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [21:57:23] RECOVERY - puppet last run on wtp1005 is OK: OK: Puppet is currently enabled, last run 6 seconds ago with 0 failures [22:16:59] !log ariel@tin Starting deploy [dumps/dumps@2a35e23]: fix checkpoint prefetch jobs [22:17:02] !log ariel@tin Finished deploy [dumps/dumps@2a35e23]: fix checkpoint prefetch jobs (duration: 00m 02s) [22:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:29] 06Operations, 10GlobalRename, 10MediaWiki-extensions-CentralAuth: Rename user TextworkerBot to VladiBot - https://phabricator.wikimedia.org/T153602#2885442 (10Peachey88) [22:32:21] (03PS1) 10ArielGlenn: if one wiki can't be monitored, don't except out, do the rest [dumps] - 10https://gerrit.wikimedia.org/r/328108 [22:33:14] (03CR) 10ArielGlenn: [C: 032] if one wiki can't be monitored, don't except out, do the rest [dumps] - 10https://gerrit.wikimedia.org/r/328108 (owner: 10ArielGlenn) [22:34:00] !log ariel@tin Starting deploy [dumps/dumps@92946f0]: make monitoring more robust [22:34:02] !log ariel@tin Finished deploy [dumps/dumps@92946f0]: make monitoring more robust (duration: 00m 01s) [22:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log